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ABSTRACT 


A computationally  efficient  estimator  for  the  Bayes  risk  is  one 

which  achieves  a desired  accuracy  with  a minimum  of  computation.  In  many 

problems,  for  example  speech  recognition,  point  evaluations  of  the  class 

conditional  densities  are  computationally  costly.  Density  evaluations  are 

the  single  most  important  factor  contributing  to  the  computational  effort 
i 

in  Bayes  risk  estimation,  thus  the  amount  of  computation  required  by  a 
Bayes  risk  estimator  is  defined  as  the  average  number  of  conditional  den- 
sity evaluations  it  performs.  The  accuracy  of  a risk  estimator  is  de- 


fined by  its  variance. 


( 


Existing  estimators  for  the  Bayes  risk,  namely  the  error  count  esti- 
mator and  the  posterior  estimator,  require  for  each  sample  X.,  j=l,2...N, 

j 

evaluation  of  the  class  conditional  density  f (X  ) for  each  class 

J m j 

m=l,2...M,  a total  of  N-M  density  evaluations.  For  problems  such  as  speech 
recognition,  where  the  number  of  classes  M is  large  and  density  evalua- 
tions costly,  these  estimators  are  impractical  from  a computational  aspect. 
A new  class  of  estimators  of  the  general  form  R(T)  is  proposed.  An 

A 

estimator  R(T)  is  defined  by  associating  with  each  class  m a subset  T of 

m 

the  M classes.  For  two  classes,  only  the  error  count  and  posterior  estima- 
tors belong  to  this  class.  For  more  than  two  classes,  several  new  esti- 
mators for  the  Bayes  risk  are  included. 


Estimators  requiring  fewer  density  evaluations  are  derived  from  the 
class  of  estimators  of  the  general  form  R(T)  as  follows.  A scalar  para- 


meter  a determines  the  sets  T (or)  of  classes  that  are  "o'-close"  to  each 

m 

class  m,  hence  an  estimator  R(or)  of  the  general  form  ft(T)  . As  o'  varies, 

the  sets  (a)  , . . . ,TM(ar)  vary  and  a family  {R(a)  : 0 5 a < Q'max3  of  risk 

estimators  is  achieved.  Each  estimator  in  the  family  is  characterized 

by  the  average  number  of  density  evaluations  it  requires  and  its  variance. 

The  optimal  estimator  from  the  family  fR(a)  : 0 £ a < a ] is  de- 
r max 

fined  as  that  estimator  with  maximum  computational  efficiency,  where  the 
computational  efficiency  of  an  estimator  is  the  inverse  of  the  product  of 
the  average  number  of  density  evaluations  it  requires  and  its  variance. 

The  optimal  estimator  requires  the  least  amount  of  computation  to  achieve 
a given  accuracy,  or,  symmetrically,  achieves  the  greatest  accuracy  with 
a minimum  of  computation. 

N>  In  practice,  the  true  optimal  estimator  cannot  be  determined  since 
this  would  in  effect  require  knowledge  of  the  true  risk  R.  Thus  a technique 
whereby  the  first  n of  the  total  N samples  are  used  to  approximate  the 
optimal  estimator  is  proposed.  The  n samples  should  contain  enough  infor- 
mation on  the  closeness  of  the  classes  to  determine  an  almost  optimal 
estimator.  The  last  N-n  samples  are  used  in  the  approximate  optimal  esti- 
mator to  obtain  an  accurate  estimate  of  the  risk  with  a minimum  of  compu- 
tation . 

V 


I 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  page  ("HTinn  D»l«  Entered) 


REPORT  DOCUMENTATION  PAGE 


rp  READ  INSTRUCTIONS 

UEFORE  COMPLETING  FORM 


2.  GOVT  ACCESSION  NO.j  3.  RECIPIENT’S  CATALOG  NUMBER 


4.  TITLE  (and  Subtitle) 

COMPUTATIONALLY  EFFICIENT  ESTIMATORS  FOR  THE 
BAYES  RISK 


7.  author's; 


5 TYPE  OF  REPORT  * PERIOD  COVERED 

Interim 


6 PERFORMING  ORp.  REPORT  NUMBER 

EE73C4  ' 


8.  CONTRACT  OR  GRANT  NUMBER'S) 


Lynn  D.  Wilcox  and  Rui  J.P.  de  Figueiredo 


9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS  '0.  PROGSAM  ELEMENT.  PROJECT.  TASK 

AREA  4 WORK  UNIT  NUMBERS 

Rice  University 

Department  of  Electrical  Engineering  . 61102F  230VA2 

Houston,  Texas  77001 


IF  CONTROLLING  OFFICE  NAME  AND  ADDRESS  12-  REPORT  DATE 

May  1978 

Air  Force  Office  of  Scientific  Research/NM  13.  number  of  pages 

Bolling  AFB,  Washington,  DC  20332 37 


14.  MONITORING  AGENCY  NAME  8 ADDRESSfi/  different  trom  Controlling  Ottice)  j 15  SECURITY  CLASS,  (of  this  report, 

! UNCLASSIFIED 


AFOSR  75-2777  1 


1 5«.  OEClASSIF 

schedule 


I C A Tl  ON  DOWNGRADING 


16  DISTRIBUTION  STATEMENT  (of  this  Report) 


Approved  for  public  release;  distribution  unlimited. 


17.  DISTRIBUTION  ST.  «ENT  (of  ' • abstract  entered  in  Block  20.  If  dtfferent  from  Report) 


19  KEY  WORDS  (Continue  on  reverse  side  if  necessary  and  identify  by  block  number) 

Pattern  recognition;  Bayes  risk;  error  estimation 


20  ABSTRACT  ( Continue  on  reverse  side  It  necessary  and  Identify  by  block  number; 

A computationally  efficient  estimator  for  the  Bayes  risk  is  one  which 
achieves  a desired  accuracy  with  a minimum  of  computation.  Existing 
estimators  based  on  error  count  or  the  risk  function  require,  at  each 
sample  of  the  test  data  set,  point  evaluation  of  the  class  conditional 
density  for  each  of  the  classes.  In  problems  such  as  speech  recognition, 
where  the  number  of  classes  is  large  and  point  evaluations  of  the  densities 
complex,  these  estimators  are  impractical  from  a computational  aspect. 


^Classified 

SECURITY  CLASSIFICATION  OF  this  PAGEfllTien  Dam  Enttrad) 


20.  Abstract 

A new  family  of  estimators  for  the  3ayes  risk  is  defined. 
Computational  forms  for  estimators  in  the  family  reduce  the  number 
of  densities  that  must  be  evaluated  at  each  test  sample.  Thus 
a computationally  efficient  estimator  may  be  chosen  from  the 
fam i 1 y . 


•c-iftg 


UNCLASSIFIED 


TABLE  OF  CONTENTS 


1.  INTRODUCTION  TO  THE  RESEARCH  TOPIC 

1.1  Introduction 

1.2  Review  of  Previous  Work 

1.3  Approach  and  Development  in  the  Present  Work 

2.  BASIC  CONCEPTS  ASSOCIATED  WITH  THE  BAYES  RISK 

3.  PROPOSED  NEW  ESTIMATORS  FOR  THE  BAYES  RISK 

3.1  Introduction 

3.2  Estimators  Based  on  Unrestricted  Sampling 

3.2.1  Remarks  on  Error  Count  and  Posterior 
Estimators 

3.2.2  A General  Form  for  Bayes  Risk  Estimators 

3.2.3  A Parameterized  Family  of  Estimators  for  the 
Bayes  Risk 

3.2.4  Computational  Requirements  for  Estimators  in 
the  Family 

3.2.5  Variances  of  Estimators  in  the  Family 

3.2.6  Examples 

3.3  Estimators  Based  on  Stratified  Sampling 

3.3.1  A Parameterized  Family  of  Bayes  Risk  Estimators 

3.3.2  Variances  of  Estimators  in  the  Family 

3.3.3  Computational  Requirements  for  Estimators  in 


Page 

1 

1 

3 

6 

10 

14 

14 

14 

15 

17 

26 

31 

40 

43 

45 

46 
48 
51 


the  Family 


Page 


4.  OPTIMAL  ESTIMATORS  53 

4.1  Introduction  53 

4.2  Computational  Efficiency:  A Criterion  for  the  Optimal 

Estimator  55 

4.3  An  Algorithm  for  Maximization  of  the  Computational 

Efficiency  57 

4.4  Comparison  of  the  Optimal  Estimator  with  the  Error  Count 

and  Posterior  Estimators  63 

• i 

4.5  Approximation  of  the  Optimal  Estimator  68 

4.6  Examples  75 

5.  CONCLUSIONS  80 

5.1  Summary  of  Results  80 

5.2  Recommendations  for  Further  Research  83 

BIBLIOGRAPHY  85 

APPENDIX  A-l 

A.  Data  From  Example  1 A-l 

B.  Data  From  Example  2 B-l 


r 


1 

CHAPTER  1 

INTRODUCTION  TO  THE  RESEARCH  TOPIC 

1 . 1 Introduction 

The  task  of  a pattern  recognition  system  is  to  decide  to  which  of 
M classes  a given  pattern  belongs.  The  decision  is  made  on  the  basis  of 
a set  of  measurements  X taken  on  the  pattern  and  is  specified  by  the  de- 
cision rule  5(X).  The  performance  of  the  system  may  be  characterized  by 
the  probability  that  it  makes  a classification  error.  The  decision  rule 
which  minimizes  the  probability  of  classification  error  is  called  the 
Bayes  rule  and  the  resulting  minimum  probability  of  classification  error 
the  Bayes  risk. 

The  Bayes  risk  represents  the  optimal  performance  of  a pattern 
recognition  system  for  a given  set  of  measurements  X.  As  such  it  may  be 
regarded  as  the  intrinsic  difficulty  of  the  problem,  or  the  confusability 
of  the  M classes.  Suppose  one  wanted  to  compare  the  difficulty  of  two 
speech  recognition  tasks.  The  number  of  words  in  each  vocabulary  would 
be  one  criterion.  However,  one  should  also  consider  the  confusability  of 
the  words  in  each  vocabulary,  as  measured  by  the  Bayes  risk  for  each 
task . 

In  this  thesis,  we  study  estimators  for  the  Bayes  risk  in  terms  of 
the  amount  of  computation  they  require  and  their  accuracy.  It  is  assumed 
that  the  class  conditional  densities  f (x)  , . . . , f^(x)  and  priors  n^,...rrM 
are  known  so  that  attention  may  be  focused  on  the  actual  forms  for  risk 
estimators.  The  results  will  also  apply  asymptotically  if  the  unknown 
densities  are  estimated  on  training  data  which  is  independent  of  the 


' 


¥ 
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test  data  used  in  the  risk  estimators,  provided  the  density  estimates 
are  asymptotically  unbiased  and  cons  is  ent . 

In  many  problems,  point  evaluations  of  the  class  conditional  densi- 
ties are  computationally  costly.  For  example,  in  speech  recognition 
[23,  l],  the  class  conditional  density  f^(x)  would  be  the  probability 
that  the  output  phone  string  x was  caused  by  the  mt^'  word  in  the  vocabu- 
lary. Evaluation  of  f^Cx)  involves  determining  all  phonetic  realizations 
of  the  m^  word,  and  for  each  phonetic  realization,  all  segmentation  and 
classification  errors  that  would  result  in  the  output  phone  string  x. 

In  estimation  of  the  Bayes  risk,  density  evaluations  are  the  single  most 
important  factor  contributing  to  the  computational  effort.  Thus  the 
amount  of  computation  required  by  a Bayes  risk  estimator  is  defined  as 
the  average  number  of  class  conditional  density  evaluations  involved  in 
the  estimation  procedure. 

Existing  estimators  for  the  Bayes  risk,  namely  the  error  count 

estimator  [6]  and  the  posterior  estimator  [11],  require  for  each  sample 

Xj , j=l,2...N  in  the  test  data  set,  evaluation  of  the  class  conditional 

densities  f (X .),...,  f,.(X .)  , a total  of  NXM  density  evaluations.  Thus 
1 J M J 

for  problems  such  as  speech  recognition,  where  the  number  of  classes  M 
is  large  and  density  evaluations  costly,  these  estimators  are  impractical 
from  a computational  aspect. 

We  propose  several  new  estimators  for  the  Bayes  risk.  In  particular, 
a family  [R(o')  : 0 <.  a < **  x3  °f  unbiased  and  consistent  risk  estimators, 
indexed  on  the  scalar  parameter  a,  is  defined.  The  parameter  a deter- 


mines, for  each  sample  X^ , the  classes  i for  which  the  class  conditional 

* 

density  f.  (X.)  must  be  evaluated  in  forming  the  estimator  R(Q') . In 
■C  j 
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general,  R^)  may  be  computed  with  fewer  density  evaluations  than  the 
NXM  required  by  the  existing  estimators.  Bayes  risk  estimators  are 
evaluated  in  terms  of  their  computational  efficiency,  defined  as  the 
inverse  of  the  product  of  their  variance  times  the  average  number  of 
density  evaluations  they  require.  An  estimator  with  maximum  computa- 
tional efficiency  is  considered  optimal.  The  optimal  estimator  has  the 
property  that  a minimum  of  computation  is  required  to  achieve  a given 
accuracy . 

1 . 2 Review  of  Previous  Work 

* 

The  usual  test  data  for  estimation  of  the  Bayes  risk  is  a sample  of 
measurements  or  observations  X and  their  true  classifications  or  labels 
8.  This  type  of  sample  will  be  referred  to  as  unrestricted  [31,  19], 
since  the  statistician  has  no  control  over  the  label  of  a sample.  There 
are  two  existing  forms  for  Bayes  risk  estimators:  the  error  count 
estimator  and  the  posterior  estimator.  The  error  count  estimator  [19,  6] 
is  simply  the  proportion  of  samples  X whose  classification  by  the  Bayes 
rule  disagrees  with  its  true  classification  8.  The  posterior  estimator 
was  first  suggested  by  Chow  [3],  later  formalized  by  Fukunaga  and  Kessel 
[11]  and  discovered  independently  by  Lissack  and  Fu  [27].  It  is  the 
sample  mean  of  the  risk  function  evaluated  at  the  sample  points.  It  is 
interesting  that  the  posterior  estimator  ignores  information  on  the 
class  labels,  yet  has  a lower  variance  than  the  error  count  estimator  [ll]. 

Another  sampling  technique  called  stratified  sampling  is  often 


possible  [31].  As  opposed  to  unrestricted  sampling,  the  statistician 
chooses  a priori  a class  label  and  samples  observations  X with  that  label. 
By  choosing  the  number  of  samples  per  class  appropriately,  the  variance 


4 


of  a given  estimator  may  be  reduced.  Neyman  [31]  determines  the  optimal 

i 

number  of  samples  per  class  by  minimizing  the  variance  of  the  estimator. 
Highleyman  [19]  applied  the  stratified  sampling  technique  to  the  error 
count  estimator.  He  did  not  choose  the  optimal  sample  sizes,  but  rather 
chose  the  number  of  samples  per  class  as  proportional  to  the  prior  pro- 
bability of  that  class.  He  shows  that  even  this  heuristic  choice  achieves 
a reduction  in  the  variance  of  the  error  count  estimator. 

Moore,  Whitsitt  and  Landgrebe  [30]  later  applied  stratified  sampling 
to  the  posterior  estimator.  They  show  the  heur is  tic . sample  size  is  not 
optimal,  but  the  optimal  sample  sizes  are  impractical  since  they  depend 
on  unknown  variances.  Stratified  sampling  with  sample  sizes  propor- 
tional to  the  priors  also  reduce  the  variance  of  the  posterior  estimator. 
Moore,  Whitsitt  and  Landgrebe  [30]  give  the  interesting  result  that 
while  for  unrestricted  sampling,  the  Dosterior  estimator  has  smaller 
variance  than  the  error  count,  this  is  not  necessarily  true  when  a 
stratified  sample  is  used,  even  with  the  optimal  choice  of  sample  sizes. 

Both  the  error  count  and  posterior  estimators  for  the  Bayes  risk 
require  knowledge  of  the  class  conditional  densities  f^x),  m=l,2...M. 

When  these  densities  are  unknown,  one  way  to  proceed  is  to  estimate  the 
densities  and  use  the  estimates  in  the  estimators  as  if  they  were  the 
true  densities.  Cover  and  Wagner  [4]  call  these  two-step  procedures. 

When  the  test  data  used  for  the  risk  estimator  must  also  be  used  to 
estimate  the  densities  (i.e.  when  the  test  data  is  the  same  as  the  train- 
ing data),  the  question  of  data  use  must  be  considered.  If  the  samples 
used  in  the  density  estimates  are  also  used  in  the  risk  estimator,  an 


optimistic  bias  in  the  resulting  estimate  for  the  Bayes  risk  is  observed. 
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/ 

If  the  data  set  is  large,  an  alternative  is  to  partition  the  data  and  use 
part  to  estimate  the  densities  and  the  rest  in  the  estimator.  Highleyman 
[19]  tried  to  optimize  this  partition  but  Kanal  and  Chandrasekaran  [25] 
questioned  his  assumptions.  The  leave-one-out  technique  of  Lachenbruch 
and  Mickey  [26]  attempts  to  remove  bias  by  estimating  the  densities  on 
all  but  one  sample  and  using  the  deleted  sample  in  the  estimator  for  the 
Bayes  risk.  Each  sample  in  turn  is  left  out  and  the  resulting  risk 
estimate  is  the  average  of  the  one-point  estimates.  An  excellent  discus- 
sion of  these  and  other  methods  of  data  use  is  given  in  Toussaint  [37] 
and  Kanal  [24].  W 

Several  density  estimates  have  been  considered  for  use  in  Bayes  risk 
estimators.  Lissack  and  Fu  [27]  and  Fukunaga  and  Kessel  [12]  assume  a 
parametric  form  for  the  densities  (exponential  family  and  Gaussian 
respectively)  and  estimate  the  parameters.  Fukunaga  and  Kessel  [ 1 2] , 

Fralick  and  Scott  [9],  and  Whitsitt  and  Landgrebe  [39]  use  Parzen 
estimators  [32].  Fukunaga  and  Kessel  [15]  and  Fukunaga  and  Hostetter 
[13]  consider  nearest  neighbor  techniques  for  direct  estimation  of  the 

. 

risk  function  used  in  the  posterior  estimator.  Lissack  and  Fu  [27] 
apply  Loftsgaarden  and  Quesenberry  [29]  nearest  neighbor  density  esti- 
mates to  obtain  estimates  for  the  class  posterior  probabilities.  A good 
discussion  of  results  when  various  combinations  of  estimator  form,  data 
use  and  density  estimates  are  tried  is  given  in  Whitsitt  and  Landgrebe 
[39]. 

Computational  difficulties  in  Bayes  risk  estimators  arise  from  the 
fact  that  for  each  sample  X,  the  conditional  density  fm(X)  of  the  sample  X 


given  class  m must  be  evaluated  for  all  classes  m=l,2...M.  Whitsitt  and 
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Landgrebe  [39]  consider  this  problem  when  the  densities  are  estimated 

with  Parzen  estimators  using  a Gaussian  kernel.  They  propose  an  edited 

Parzen  estimator  for  the  densities  f (x)  . Rather  than  averaging  the 

m 

kernel  over  all  data  points  with  class  label  m,  the  average  is  taken  over 
only  those  data  points  labeled  m which  are  the  k nearest  neighbors  to  the 
point  X. 

Any  density  estimate  which  requires  nearest  neighbors  may  be  simpli- 
fied by  algorithms  which  find  nearest  neighbors  efficiently.  These  in- 
clude condensed  nearest  neighbor  rules  by  Hart  [l8j  and  Swonger  [36],  a 

ft 

branch  and  bound  algorithm  by  Fukunaga  and  Narendra  [14]  and  preprocessing 

techniques  by  Fisher  and  Patrick  [8],  Yunk  [40],  and  Friedman  et  a.  [ 10] . 

The  above  techniques  achieve  reduction  in  computation  by  simplifying 

the  evaluation  of  the  conditional  densities  f (X)  at  the  data  points.  In 

m 

this  thesis,  computat iona 1 ly  efficient  estimators  are  achieved  by  reducing 

the  number  of  densities  which  must  be  evaluated  at  a given  sample  point. 

Thus  rather  than  evaluate  fm(X)  for  all  classes  m=l,2...M  at  the  point  X, 

we  might  only  evaluate  f^CX)  for  m in  a subset  of  the  total  classes.  This 

is  profitable  in  problems  such  as  speech  recognition  where  the  number  of 

classes  M is  large  and  computation  of  conditional  densities  complex  [23,  l]. 

1 . 3 Approach  and  Development  in  the  Present  Work 

A new  class  of  Bayes  risk  estimators  of  the  general  form  R(T)  is 

proposed.  The  estimator  R(T)  is  defined  on  the  basis  of  sets  T^,...,T^, 

where  T is  a set  of  classes  associated  with  class  m.  Subject  to  mild 
m 

restrictions,  any  choice  of  the  sets  T^ , . . . ,T^  results  in  an  unbiased, 
consistent  Bayes  risk  estimator.  Both  of  the  existing  estimators,  namely 
the  error  count  and  the  posterior,  belong  to  the  class  of  estimators  of 


r 
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the  general  form.  R(T)  . 

In  order  to  obtain  risk  estimators  which  require  fewer  class  condi- 
tional density  evaluations,  we  restrict  the  set  T of  classes  associated 

m 

with  class  m as  follows.  A scalar  parameter  a determines  the  set  of 
classes  T^Ccr)  that  are  ’ta'-close"  to  class  m,  that  is,  a sample  X whose 
true  classification  0 is  m is  likely,  as  determined  by  a,  to  be  classified 
as  i,  whenever  classes  i and  m are  "a-close".  As  a varies,  the  set  of 


classes  (o')  , . . .TM(a)  vary  and  a family  of  risk  estimators 
{R(or)  : 0 ^ a < °max3 > indexed  on  the  parameter  or,  is  achieved. 

W * 

The  definition  of  the  sets  T ^ (a ) , . . . (a ) allows  the  estimator  R(a) 
to  be  formed  with  fewer  class  conditional  density  evaluations.  Thus 


rather  than  evaluate  the  conditional  density  f„(X.)  at  each  sample  X , 

l 3 J 

j=l,2...N  for  each  class  -£-=l,2...M,  the  estimator  RCo)  requires  evalu- 
ation of  f (X.)  for  only  those  classes  in  a subset  Q (a)  of  the  total 
v J 0 j 

classes  {l,2...M],  whenever  the  joint  density  f (X.)nc  of  the  sample  X. 

J J J 

and  its  class  label  0.  is  greater  than  a-  The  subset  Q ( a ) is  the  set 

J 6j 

of  classes  that  are  ’tar-close"  to  each  class  that  is  ’ta'-close"  to  the  class 


labe 1 0 . of  X . . 

J J 

The  amount  of  computation  required  by  the  estimator  RCa)  is  expressed 
by  NxCCo),  the  average  number  of  conditional  densities  that  must  be 


evaluated,  where  N is  the  sample  size  and  C(a ) is  the  average  number  of 
conditional  densities  per  sample  used  in  forming  R(a) . The  error  count 
and  posterior  estimators  require  all  M conditional  densities  per  sample, 
a total  of  MxN  density  evaluations.  Thus  if  a is  such  that  C(a)  is  much 
smaller  than  M,  the  estimator  R(or)  would  be  computationally  preferable  to 


8 

either  of  the  existing  estimators. 

The  accuracy  of  an  estimator  R(ar)  based  on  N samples  is  given  by  its 

i 

variance  V(a)/N.  Thus  the  larger  the  sample  size  N,  or  the  smaller  the 
coefficient  of  variance  V(or)>  the  more  accurate  the  estimator.  The  esti- 
mator in  the  family  fR(o')  : 0 £ a < a } with  the  smallest  coefficient 
of  variance  V(cO  has  the  property  that  it  requires  the  least  number  of 
samples  N to  achieve  a given  accuracy.  However,  the  size  of  the  sample 
is  not  sufficient  to  characterize  the  amount  of  computation  required  by  an 
estimator  in  the  family  {R(a)  : 0 £ a < amax}>  since  the  average  number 
of  density  evaluations  per  sample  C(o)  required  by  each  estimator  must  be 
considered. 

We  define  the  computational  efficiency  C£(a)  of  an  estimator  R(ar) 

as  the  inverse  of  the  product  of  its  variance  and  the  average  number  of 

density  evaluations  it  requires,  thus  C £ = l/V(o)xC(a)  . The  optimal 
- ■*  « 

estimator  R(a  ) from  the  family  [R(a)  : 0 £ a < J is  determined  by 

* 

choosing  o'  to  maximize  the  computational  efficiency  CC-(a)  . The  optimal 

estimator  R(c*  ) has  the  property  that  it  achieves  a given  accuracy  with 

a minimum  of  computation  [16,  17],  or  symmetrically,  that  for  a given 

~ * 

amount  of  computation,  R(Qf  ) is  the  most  accurate  estimator  for  the  Bayes 
risk  R. 

In  practice,  the  optimal  estimator  could  not  be  determined  in  this 
way  since  this  would  in  effect  require  knowledge  of  the  true  risk  R. 

Thus  a technique  is  proposed  whereby  a subset  n of  the  total  N samples 
is  used  to  approximate  the  optimal  estimator.  The  number  n of  samples 
should  contain  enough  information  on  the  closeness  of  the  classes  to 


determine  an  almost  optimal  estimator.  The  remaining  N-n  samples  are 


CHAPTER  2 


BASIC  CONCEPTS  ASSOCIATED  WITH  THE  BAYES  RISK 

A general  pattern  recognition  system  may  be  modeled  mathematically 
in  terms  of  a probability  triple  (Q,  F,  P) , an  observation  random  vari- 
able X,  and  a labeling  random  variable  6.  Let  Q be  the  space  of 
patterns  uu,  F a sigma  field  of  subsets  of  fj  and  P a probability  measure 
defined  on  F.  The  patterns  uu  € Q are  to  be  classified  into  one  of  M 
classes,  where  the  classes  are  a disjoint  partition  of  Q.  If 

a pattern  x € we  say  uu  is  in  class  m. 

m 

The  random  variable  0 : Q -•  [l,2...M]  specifies  the  class  of  a 

pattern  uu,  so  that  e(x)  = m whenever  uu  € H . 0 is  referred  to  as  the 

m 

class  label  or  simply  the  label  of  a pattern.  The  prior  probability 
of  the  m1*'  class  is  given  by 

TT  = P[H  ] = P[  0=m].  (2-1) 

ra  m 

In  practice,  the  patterns  uu  € Q are  not  actually  observed.  Rather, 
one  observes  a set  of  measurements  made  on  uu . The  random  variable 
X : f)  -*  S C R^  specifies  the  measurements  X(uu)  = x € S made  on  a pattern 
uu.  Assume  the  conditional  density  of  X given  9=m  exists  and  is  continu- 
ous and  denote  it  by  f (x) . Then  the  unconditional  density  of  X,  or 

m 

the  mixture  density  is  given  by 

M 

f(x)  = Z n f . (x) . (2-2) 

1=1  1 * 

Also,  the  posterior  probability  of  class  m,  the  probability  that  6=m 
given  the  observation  X(uu)=x  is 


f (x)T7 

/ v in  m 

Pm(x)  = "fO0~  ‘ 


(2-3) 


On  the  basis  of  the  observation  X(uj)=x,  the  recognition  system 


tries  to  decide  the  true  classification  of  the  pattern  uu,  i.e.  the 


value  of  0 (uu)  . This  decision  may  be  specified  by  a behavioral  de- 


cision rule  6(x)  = (6^(x)  .^(x)  • • . ^M(x))  > where  6m(x)  is  the  probabil- 


ity that  the  recognition  system  classifies  a pattern  uu  as  belonging  to 


class  m,  given  the  observation  x.  Thus  6^(x)  a 0,  m=l,2,...M  and 


« (x)  = 1. 

ra=  1 m 


Given  a decision  rule  6,  the  probability  R(6)  that  the  system  makes 


a classif ication  error  may  be  written 


M . M 

R(6)  = £ TT  J £ 6 . (x)  f (x)dx 

m=l  m " . , l m 

S 


(2-4) 


= £.  TT  J (1-6  (x))f  (x)dx 
m=  1 in  m m 


It  is  well  known  [2,  7]  that  the  decision  rule  6 which  minimizes  the 


probability  of  classification  error  R ( 6 ) is  the  Bayes  decision  rule  6 , 


where  ties  are  broken  at  random  and 


. 1 f (x)TT  > f.  (x)77.  V -t-1 ,2 . . .M  , Urn 
(x)  (2-5) 

v r\  1 ^ C / ..  \-r  ^ C 


0 E k?b  5 f.  (x)TT.  > f (x)n 
^ k k m m 


The  minimum  probability  of  classification  error  resulting  from  Bayes 


decision  rule  is  called  Bayes  risk  and  is  denoted  by  R. 


The  error  function  CQ(X)  is  defined  for  6=m,  X=x  as  one  minus 


the  Bayes  rule  6 (x) , 

m 
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e 


0 f (x)n  > f.(x)n.  V-t=l,2...M,  1,4m 

(x)  = { m m 11  (2-6) 

1 v i z* 

1 -i 


k^m  9 f (x)n,  > f (x)tt 
k k m m 


Then  the  bayes  risk  R is 


M 


R = Z.  TT  J £ (x)  f (x)d> 
m=l  m ^ m mv 


(2-7) 


Note  that  R is  just  the  expectation,  over  the  random  variables  X and 


6,  of  the  error  function  £0(X),  so 

R = e{£0(X)3  • 

The  conditional  risk  R^  is  the  probability  of  classification 
error  given  class  m, 

R = J £ (x)  f (x)dx 

m g m tn 

Thus  R is  the  conditional  expectation  of  the  error  function  £.  ex') 
m 0 


(2-8) 


(2-9 


given  0=m. 


Rm  = K£ee(x)  | ©=m]  . 


(2-10) 


Since 


R = E[e  (X)l  = e[e[c  (x)  |e 3 } 


(2-11) 


M 


M 


we  have  that 

R = E.  TT  E{£  (X)  |e=m]  = if,  rr  R . (2-12) 

m=lm  6m=lmm 

The  risk  function  r(x)  is  the  probability  of  classification  error 


given  the  observation  X=x.  Symbol ica 1 ly , 
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CHAPTER  3 

PROPOSED  NEW  ESTIMATORS  FOR  THE  BAYES  RISK 

3 . 1 Introduction 

In  this  section,  a general  form  R(T)  for  Bayes  risk  estimators 

is  defined.  Based  on  the  general  form,  a family  of  estimators 

fR(a)  : 0 £ a < a }, indexed  on  a scalar  parameter  a,  is  derived, 
max 

A computational  form  for  estimators  in  the  family  is  given  which  in 
general  allows  these  estimators  to  be  computed  on  the  basis  of  fewer 
density  evaluations  per  sample.  The  computational  requirements  of  an 
i estimator  R(Q')  may  be  described  by  the  expected  number  of  density 

evaluations  C(a)  per  sample.  The  behavior  C(a)  as  a function  of  oi , 
as  well  as  the  behavior  of  the  variance  V(cv)  of  RCcr)  are  discussed. 

Two  sampling  techniques  for  estimation  of  the  Bayes  risk  are 
considered:  unrestricted  sampling  and  stratified  sampling.  The  basic 
difference  between  these  sampling  techniques  is  that  in  unrestricted 
sampling,  the  number  of  samples  with  a given  class  label  is  random, 
while  for  stratified  sampling  the  statistician  chooses  a priori  the 
number  of  samples  with  a given  class  label. 

3 . 2 Estimators  Based  on  Unrestricted  Sampling 

For  unrestricted  sampling,  the  data  is  a set  sequence 

{ (XjQ^)  , (X2G2) . • • • of  N independent  random  vectors  identically 

distributed  as  (X,6).  The  joint  density  of  (X,9)  at  X-x,0=m  is  given 

by  f (x)n  , where  f (x)  is  the  conditional  density  of  X given  the 
mm  m 

class  label  9=m  and  TTm  is  the  prior  probability  of  class  m.  The  mar- 


ginal density  of  the  observation  X at  X=x  is  f(x),  the  mixture  density. 
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The  proportion  of  samples  whose  class  label  9,  is  m is  random, 

with  mean  TT  . 

m 

3.2.1  Remarks  on  Error  Count,  and  Posterior  Estimators 

/s 

The  error  count  estimator  R(ec)  for  the  Bayes  risk  R is  formed 
by  counting  the  proportion  of  samples  whose  classification  by  the 
Bayes  rule  disagrees  with  the  true  class  label  0^ . Symbolically, 


. i n e (x.) 

r<“>-5  jSi  V j 


(3-1) 


The  error  count  estimator  is  unbiased,  since 


R(ec) } = E{e0(X)3  = R. 


It  is  also  consistent  [ I 9 , 6],  since 


VAR[ R(ec)}  = i VAR{fi0(X)} 


(3-2) 


(3-3) 


The  error  count  estimator  R (ec)  for  the  conditional  risk  R given 

m m ° 


class  m is 


e (x;) 


R (ec)  = ~ El  (9.)mv 
m N . , m i — — 

j=l  J n 


(3-4) 


1,  0=m 

where  I (0)  = ■{ 

m 1 0,  6^ 


R (ec)  is  an  unbiased  estimator  for  R since  [33] 
m m 


E{  I (0)£  (X)] 

E{R(ec)}  = = E{£  (X)|e=m}  = R 


(3-5) 
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Note  that  the  error  count  estimator  R (ec)  considers  only  classification 


errors  made  on  samples  whose  class  labels  6^=m.  Also, 


M 


R(ec)  = I TT  R (ec) 
m=l  m m 


(3-6) 


The  posterior  estimator  R(p)  for  the  Bayes  risk  R is  the  sample 


mean  of  the  risk  function  r(X^)  over  the  samples  j=l,2...N  [ll]. 


Thus 


N 


N M 


R(p)  - 5 jS,  pffj)  ■ s 3h  * e,<V  Wjl  (3-7) 

The  posterior  estimator  is  unbiased,  since 


e{R(p)3  = E{ r (X) ) = R. 


(3-8) 


It  is  also  consistent,  since 


VAK{R(p)3  = | VAR{ r (X) } = | [ J r2(x)f(x)dx  - R2] 


(3-9) 


It  has  been  shown  [ll]  that  the  posterior  estimator  has  smaller 


variance  than  the  error  count  estimator.  This  follows  from  the  fact 


1 


that  since  0 £ r(x)  £ 1 - — 

M 


J r (x) f (x)dx  £ R - ^ 
S 


(3-10) 


and  thus 


VAR{  R(p) ) ^ ^ i R(^'R)  = VAR{ R(ec)  ] 


(3-11) 


The  posterior  estimator  R (p)  for  the  conditional  risk  R is 

m m 


I 

1 


defined  by 
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(P)  = £ .?  e (x.)W 
5 N J=1  m J ToT) 


(3-12) 


In  contrast  to  the  error  count  estimator,  the  posterior  estimator  R (p) 

m 

considers  errors  made  on  all  samples  X ^ , regardless  of  their  class  labels 

0..  In  fact,  the  posterior  estimator  makes  no  use  of  the  class  labels. 

J 

The  expectation  of  Rm(p)  is  thus  taken  with  respect  to  the  mixture 


density  E(x) , so 


f„<*> 


= J*  C (*)  ff-'T  f(x)dx  = J em(X)fm(X^dX  = Rm  (3-13) 

n rn  jl  \ x j q in  m in 


Thus  R (p)  is  an  unbiased  estimator  of  the  conditional  risk  R . Again 
m r m 


(3-14) 


R(p)  = T.  n R (p)  . 
r m=l  mm 


3.2.2  A General  Form  for  Bayes  Risk  Estimators 


The  error  count  estimator  for  the  conditional  risk  R in  effect 

m 

considers  only  those  samples  X^  whose  class  labels  0^  are  equal  to  m. 

The  posterior  estimator  for  R considers  all  samples,  regardless  of 

m 

their  true  classification.  This  concept  may  be  generalized  by  associa- 
ting with  each  class  m a subset  T of  the  total  classes,  and  forming 

m 

an  estimator  R (T)  based  on  those  samples  X.  whose  class  labels  9. 

m J J 

are  elements  of  T . 

m 

Specifically,  for  each  m=l,2...M,  let  T “ fi,,i....i  } be  a 

m 1 i p 

m 

set  of  p classes  associated  with  class  m,  where  i.,  j=l,2...p  are 
rm  ’ J m 

The  sets  Tm>  m=l,2...M  may  be  chosen  arbitrarily. 


members  of  the  set. 
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■TABLE  3-1 


Tl  = {1,2,3]  Qx  = {1,2,3] 

TL  = [1,3]  Qx  = [1,3] 

T2  = {1,2,3]  Q2  = {1,2,3] 

T2  = [2]  Q2  = [2] 

T3  = {1,2,3]  Q3  = {1,2,3] 

T3  = [1,3]  Q3  = [1,3] 

Tj_  = {1,3]  Qx  = [1,2,3] 

Tx  = [1]  Ql  = [1] 

T2  = {2,3]  Q2  = [1,2,3] 

T2  = [2,3]  Q2  = [2,3] 

T3  = {1,2,3]  Q3  = {1,2,3] 

T3  = [2,3]  Q3  = {2,3] 

Tx  = {1,2,3]  Ql  = [1,2,3] 

Tl  = {1,2]  Qx  = {1,2] 

T2  - [1,2]  Q2  = [1,2,3] 

T 2 = [1,2]  Q2  = [1,2] 

T3  = [1,3]  Q3  = [1,2,3] 

T3  « [3]  Q3  = [3] 

Tx  = [1,2]  Q3  = [1,2,3] 

T j = Cl]  Qj  = [1] 

T2  = {1,2,3]  Q2  = [1,2,3] 

T2  = [2]  Q2  = [2] 

T3  = {2,3]  Q3  = [1,2,3] 

T3  = ^ Q3  = t33 

All  possible  choices  of  the  sets  T , 01=1,2,3  subject  to 
restrictions  (rl)  and  (r2)  and  the  resulting  sets  Q^, 


m=l ,2,3 . 
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subject  to  the  following  restrictions. 

(rl)  m 6 T Vm=l,2. . .M 
' m 

(r2)  i € T iff  m 6 T.  Vi,ra=l ,2. . .M 

' ' m l 

Restriction  (rl)  requires  that  each  class  be  associated  with  itself, 
and  restriction  (r2)  requires  that  a class  i be  associated  with  class 
rc  if  and  only  if  class  m is  associated  with  class  i. 


M(M- 1 ) 


For  M classes,  there  are  2 


different  ways  to  choose  the 


sets  T , m=l,2...M,  subject  to  restrictions  (rl)  and  (r2).  Table 
3-1  lists  the  8 choices  for  the  case  of  M = 3 classes. 

The  general  form  Rm(T)  for  an  estimator  of  the  conditional  risk 

Rm  is  defined  by  considering  those  samples  X.  whose  class  labels  0.  are 

3 J 

in  T . Thus 
m 

*„<«  -s  j5i  vvw  — 

l r.'r  ^ J ^ 


(3-15) 


where  T 


,1  0.  € T 

(8  ) = { J 

i J 0.  I T 

J «" 


Subject  to  the  restriction  (rl),  R (T)  is  an  unbiased  estimator  for 

R for  any  choice  of  the  set  T , since 
m m 


£ (X)f  (X) 

Et^CT))  . E[It  (6)  -E“ ,-tigr-] 

ID  | "v  {/ 


” i=l  { XT  (,)  E f (x)n,  X 

S t€T  1 1 

m 

E f.(x)n. 
i€T 

- J*  £ (x)f  (x)  ■ — dx  . 

t6T 

m 


(3-16) 


= J*  £ (x)f  (x)dx  = R 

v m m n 


A general  estimator  R(T)  for  the  unconditional  risk  is  formed 


R(T)  = E ttR(T) 
1=1  mm 


, N M e (X.)f  (X.)tt 

= - E v i f8  1 — i . ... m — J—  m 

N j=l  m=l  T V °j'  £ f, (X.)tt, 

m 1 J 1 

m 


(3-17) 


By  linearity  cf  the  expectation  operator,  R(T)  is  unbiased  for 
any  choice  of  the  sets  T , m=l,2...M  (subject  to  (rl)).  By  re- 
striction (r 2) , I (G)  = I (m)  , thus  from  (3-17),  R(T)  may  be  written 


f (X  )TT 
m 1 m 


i .E,  E e.  (X.)  -rr^v  : - 

N J = 1 m£T  m J ~ f/(Xi>nA 

jct  J <- 


(3-18) 


The  general  form  R(T)  is  also  consistent  for  any  restricted  choice 


of  the  sets  T , m=l,2...M.  This  follows  because 
ra 
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(3-19) 


f (X)TT 

VAR{  R(T)}  = VARf  I em(X)  — 0 V-(-x^-) 

m€Te  -ter  1 1 

m 


. C (x)f  (x)tt  0 

■ » [ilint  J < ^ I,,).')  fiWdx-R2) 

N 1 S m£T.  ^ .(,W  -C  1 


Note  that  when  each  class  in  is  associated  only  with  itself,  that 

is  when  T = {m3  Vm=l,2...M,  the  general  estimator  R(T)  is  just  the 
m 

error  count  estimator  R(ec),  since  in  this  case 

N M 


" R(T)  = l £.  I I_  (6.)  t (X.) 

N 1—1  m- 11  1 ml 

J m J J 


" n j=i  ee.(xj)  ^(ec) 
J j 


(3-20) 


When  each  class  m is  associated  with  all  other  classes,  that  is  when 

= {l,2...M}  Vm=l,2...M  then  the  posterior  estimator  R(p)  is  obtained, 


since 


N M 


. £ (X.)f  (X.)TT 

R(T)  = ^ £ T IT  (6  ) m J - m J-  m 
N N i-l  m=l  T j M 

J m 


N M e (X  )f  (X  )n  „ 

- Z Y.  — J — S-L-  n'  - R(p) 

N j = l m=l  f(Xj) 


(3-21) 


The  number  of  different  estimators  for  the  Bayes  risk  specified 

M(M-  1 ) 

« 2 

by  the  general  form  R(T)  in  (3-18)  is  2 , the  number  of  differ- 

ent ways  to  choose  the  sets  T , m=l,2...M,  subject  to  the  restrictions 
(rl)  and  (r2).  If  the  number  of  classes  M = 2,  only  two  estimators 
may  be  obtained,  namely  the  error  count  and  posterior.  For  M > 2, 
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the  general  form  R(T)  specifies  several  new  estimators,  depending 

on  how  the  sets  T , m=l,2...M  are  chosen.  Let  us  first  consider  how 
m 

to  choose  the  sets  T , m=l,2...M  so  that  an  estimator  R(T)  with 

m 

minimum  variance  is  achieved. 

It  was  shown  in  section  3.2.1  that  the  posterior  estimator  R(p) 
has  smaller  variance  than  the  error  count  estimator  R(ec) . The  fol- 
lowing theorem  generalizes  this  result  by  showing  that  the  choice  of 

the  sets  T jn=l,2,...M  which  minimizes  the  variance  of  the  estimator 
m 

R(T)  is  T = {l,2...MJ  Vm=l,2...M.  But  R(T  ) is  just  the  posterior 
estimator  R(p),‘thus  the  posterior  estimator  has  the  smallest  variance 
of  any  estimator  of  the  general  form  R(T). 

Theorem  3-1 


Let  R(T)  be  the  general  estimator  for  the  Bayes  risk  given  by 


equation  (3-18),  with  the  sets  T , ra=l,2...M  chosen  arbitrarily. 

tn 

Let  R(T  ) be  the  general  estimator  with  the  sets  chosen  as 


T = { 1,2. . .M)  Vm=l , 2 . . .M 


Then  VAR{R(T*)}  £ VAR{r(T)] 


f (x)n 

r (X,e)  = i e (X)  -f — (v\n 
”€Te  «I 

m 


(3-22) 


Then  from  equation  (3-19) 


VAR{R(T)}  = ^ VAR[rT(X,0)3 


(3-23) 
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The  conditional  expectation  of  rT(X,0),  given  X ,is  r(X),  the  risk 
function.  To  see  this, 


E{rT(X,0)|x]  = ,El  rT(X,i)p.(X) 


= .E,  E e (X)  £m(X)TTm  p (X)  . (3- 

1 1 m6T  . E f . (X)n. 

1 l£T  1 C 

m 

f (X)n 

Now  pi(X)  = yyyy  from  equation  (2-3), 

M M 

and  by  restriction  (r2),.E  E = E.  E . Thus  from  (3-24), 

V 1-1  m€  T . i€T 

l m 


M e (X)f  (X)n  f.(X)TT. 

r . ^ m m m l l 

e{ rT(X,e) |x3  = ^ E E £ (X)n'  f (X) 

lGTm  {-(ET  1 1 

m 


(3-24) 


M f (x)n 

= E.  R (X)  m77yr 
m=l  m f(X) 


(3-23) 


* * 

Also  since  R(T  ) 


= R(p) » 


VAR[R(T  )}  = yj  VAR{  r (X)  3 


(3-26) 


Now  the  variance  of  the  conditional  expectation  is  less  than  the  total 
variance,  since  by  [34] 

VAR[E[rT(X,9)|  X33 

= VAR{rT(X,0)3  - E{VAR{rT(X,0)|x33 
£ VAR[rT(X,0)3 


(3-27) 
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Thus  VAR[r(T*))  = | VAR{ r (X) ] 

= ^ VAR{  E{  r T(X  ,9  )|  X } ) £ ~ VAR[rT(X,6)3  = VAR{r(T)3 

If  the  sets  T , m=l,2...M  are  chosen  to  minimize  the  variance  of 
m 

A ^ ★ 

the  estimator  R(T),  the  resulting  estimator  is  R(T  ) = R(p) , the  pos- 
terior estimator.  Another  consideration  in  the  choice  of  the  sets 
T , m=l,2...M  is  the  amount  of  computation  required  by  the  estimator 
R(T). 

From  equation  (3-21),  R(p)  requires  at  each  sample  X^,  j=l,2...N 

the  computation  of  the  conditional  density  f . (X  . ) for  each  class  -1=1, 2... M, 
; J 

a total  of  MXN  conditional  density  evaluations.  In  problems  such  as 
speech  recognition  [23,  l] , where  the  number  of  classes  M is  large  and 
evaluation  of  the  conditional  densities  complex,  the  amount  of  computa- 
tion required  by  the  posterior  estimator  R(p)  is  considerable.  Thus 
we  consider  choosing  the  sets  T , m=l,2...M  in  such  a way  that  R(T)  may 
be  computed  on  the  basis  of  fewer  density  evaluations  per  sample. 

For  now  let  us  disregard  the  fact  that  the  error  function  £ (X.) 

m j 

must  be  determined  at  each  point  X.  and  for  each  class  m € T-  . Then 

J 6j 

from  (3-18)  the  densities  explicitly  required  in  the  estimator  R(T)  at 

the  point  X.  are  f, (X.)  for  all  classes  t in  T , for  all  m in  TQ  . De- 
j v j m Oj 

fine  the  sets  Q , m=l,2...M  associated  with  given  sets  T , m=l,2...M 
m m 

as  the  union  over  q in  T of  T . 

m q 
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Definition  3-1  Q = tl  T m=1.2...M 

m q > 

q€T  H 
m 

Table  3-1  gives  the  sets  Q ,m=l,2...M  resulting  from  each  choice  of 

tn 

the  sets  T ,m=l,2...M.  Thus  R(T)  for  a given  choice  of  sets  T re- 
m m 

quires  explicitly  the  evaluation  of  f^(Xj)  Vl  € Q0  for  each  sample 
Xj,j=l,2...N*.  J 


Example  3-1  Suppose  R(T)  is  based  on  one  sample  (X^8^)  and  that 
0L  = 1.  Then  if  T1-£l,2)  T2=[ 1 ,2}  and  T3={3}. 

fi(xi)TTi  f2(x,)n 

R<1)  ■ VV  yyvW ~2  +e»<xi)  W”i+W”2 

Since  = T^  Vi  and  3 i Q^,  is  not  used  explicitly  in  R(T) . 

Of  course,  the  error  function  £ (X)  is  an  implicit  function  of 

all  the  conditional  densities  f^(X) , l = 1,2... M,  and  from  (3-18) 

£ (X.)  must  be  computed  Vm  € T.  . In  the  next  section,  the  set  of 
n>  j 0. 

classes  T associated  with  class  m will  be  chosen  in  such  a way  that 
m y 

£m(X.),  m 8 T0  , may  be  determined  on  the  basis  of  the  densities 
J j 

f^(X  ), -b  € Qq  explicitly  required  by  the  estimator  R(T)  . Thus  esti- 

J j 

mators  requiring  fewer  density  evaluations  are  achieved. 


This  analysis  assumes  we  must  always  compute  ffi  (X.),  even  though  this 

J Mx.) 


is  unnecessary  when  Q.  ={0.}  since  we  would  be  evaluating  1 by  ^ ^ : 

J » (X  ) 

j 

■,  f-  (X  ) will  always  be  needed  to  compute  £ (X  ) . 

o J m j 


However , 


yj- 
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3.2.3  A Parameterized  Family  of  Estimators  For  the  Bayes  Risk 

i 

In  section  3.2.2,  it  was  shown  that  any  choice  of  the  sets  T 

m 

associated  with  class  m would  determine  an  unbiased,  consistent  esti- 
mator for  the  Bayes  risk  of  the  general  form  R(T) , provided  the  sets 
T , m=l,2...M,  satisfy  restrictions  (rl)  and  (r2) . In  this  section, 


we  restrict  the  choice  of  the  sets  as  follows.  A scalar  parameter 


a ^0  defines  the  set  T (or)  of  classes  'b'-close"  to  class  m.  Basically, 

m 3 * 


class  i is  ,b'-close"  to  class  m if  the  Bayes  rule  is  likely  (as  deter- 
mined by  a)  to  classify  a sample  X whose  true  class  label  0=m,  as  class 
i.  As  Qf  varies*  the  sets  T (O')  vary  and  a family  : 0 £ Cr  < a } 


m ' max 

of  unbiased  estimators  is  achieved.  The  definition  of  "or-closeness" 


allows  the  estimator  R(o)  to  be  computed  with  fewer  density  evaluations 
per  sample. 


Let  or  £ 0 be  a scalar  and  define  T (o'),  the  set  of  classes  o-close 

mv  ’ 


to  class  m by 
Definition  3-2 


T (a)  = fi  3x9  f .(x)TT.  > Qf  and  f (x)n  > a},  for  m=l,2...M. 
m l i m m 


It  follows  from  the  definition  that  i € T^ar)  if  and  only  if  m € T^cr), 


thus  restriction  (r2)  is  met  automatically  for  all  Of.  Restriction  (rl), 


that  m € T (a)  Vm=l , 2 . . .M  is  met  by  restricting  0 £ O'  < a , where 
m * max 


we  define  a as  follows, 
max 


Definition  3-3  a 


= min  max  f,  (x)rr,-  . 

max  i*um  xes  1 ^ 


A given  a,  0 £ a < aTnay_  does  not  uniquely  determine  the  sets 


T (o)  ,m=l  ,2 . . .M  since  it  is  possible  that  a j- a'  but 


m 


T (a)=T  (or  ,)Vm;=l  ,2 . . .M.  Let  a ,Q  — a be  the  set  of  a's  that  in- 

m m to  tL  tR 


I 
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TABLE  3-2 


Osxa 

1 

Tx(a)  = {1,2,3} 
T2(or)  = {1,2,3} 
T3(<*)  = {1,2,3} 

QjCor)  = {1,2,3} 

Q2(a)  = {1,2,3} 

Q3(rv)  = {1,2,3} 

a 3ot<  a 

C1  4 

T^or)  = {1,2} 

Q3(a)  = {1,2,3} 

• 

T2(o)  = {1,2,3} 

Q2(cr)  = {1,2,3} 

T3(o)  = {2,3} 

Q3(<0  = {1,2,3} 

O'  5 32<a 

t t 

TjCor)  = {1,2} 

Q ,(<*)  = {1,2} 

2 3 

T2(a)  = {1,2} 

Q2(o)  = {1,2} 

T3(or)  = {3} 

Q3(a)  = {3} 

or  <at<a 

max 

TjCcr)  = {1} 

= {1} 

T2(or)  = {2} 

Q2(«)  = {2} 

T3(«)  = {3} 

Q3(or)  = {3} 

Sets  T (or)  and  Q (a)  for  all  0 ^ cc  < O' 

m m max 
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duce  changes  in  the  sets  T for  some  m,  defined  recursively  as  follows. 

Definition  3-4 

Let  =0 
o 

Do  i=0  by  1 while  a < ot 

J t . max 

l 

Let  a be  the  smallest  value  of  a > a such  that 

t...  . t. 

l+l  l 

T (cc  ) f T (a  ) for  some  m=l,2...M. 
m t . . , rn  t . 

l+l  l 

End . 


Let  a be  the  largest  value  so  defined. 
tK 

Figure  3-1  shows  three  joint  densities  and  the  values  a , , 

o 1 

a , a and  or  Table  3-2  shows  all  possible  sets  T (or)  ,m=l,2,3  that 

t2  t3  max  r m' 

are  defined.  The  K+l  values  ot  ,ot  ...a  determine  the  K+l  possible  sets 


t 't-, 
o 1 


K 


M(M-l) 

T (»)  . Note  that  K <:  — ^ , so  that  the  number  of  possible  sets  T de- 
m L m 


M(M-l) 


fined  by  a is  much  less  than  the  number  of  2 
section  3.2.2. 


that  were  possible  in 


The  parameter  a determines  for  each  m=l,2...M,  the  sets  T (a)  of 
classes  o-close  to  class  m.  An  unbiased  estimator  R (a)  for  the  condi- 


tional risk  R based  on  samples  X.  whose  labels  9.  are  elements  of  T (a) 
m j J mw 

is  defined  from  the  general  estimator  R^(T)  in  (3-15)  as 


i N w 

Rm(Qr)  = N j = l lT  (cO(V  em(Xj)  E f.  (X.)n, 


(3-28) 


le t (a) 


The  estimator  R(or)  for  the  unconditional  risk  R determined  by  the  sets 

T (a r)  m=l,2...M  resulting  from  a is,  as  in  (3-18) 
m 

N 


1 


R(»)  = - .E,  E 

N j=l  . m 

J m€T0  (a) 


f (X.)tt 

e.  (x.) - — 1 — 51 


r E f,(X  )n.  • 

UT  (a)  **  J L 


(3-29) 
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As  O'  varies  between  0 and  a _,  a family  of  unbiased,  consistent 


max 


estimators  f R (cy ) : 0 ^ O'  < a } is  obtained.  Each  a does  not  deter- 

max 


t M(M-1) 

j,R(a)  : 0 5 O'  < Q'irio^J  contains  at  most  — ^ / + 1 different  esti 


max 


mine  a unique  form  of  an  estimator  since  each  a does  not  uniquely 
determine  the  sets  Tm(ar)  , m=l,2...M.  In  fact,  the  family 

:imators . 

However,  the  value  of  the  parameter  a is  important  in  determining  the 

density  evaluations  required  by  the  estimator  R(Q'). 

When  a = 0,  the  estimator  R(0)  is  equivalent  to  the  posterior 

estimator  R(p),  of  eq.  (3-7),  in  the  sense  that  estimates  of  the  risk 

resulting  from  either  estimator  are  identical  and  their  variances  are 

the  same.  However,  as  a member  of  the  family  fR(O')  : 0 £ a < a 3 it 
’ ' max' 

A 

is  possible  (if  the  conditional  densities  have  finite  support)  that  R(0) 
may  be  computed  with  fewer  density  evaluations . 


If  2 a < or  such  that  1(0)  = fm]  , m=l,2...M  then  R(o  ) is 
e max  mv  e'  L J ’ * e 


equivalent  to  the  error  count  estimator  R(ec)  of  (3-1).  Thus  the  pos- 
terior estimator  is  always  equivalent  to  a member  of  the  family 


{R(a)  : 0 <.  a < a i while  the  error  count  may  or  may  not  be.  For  the 

IHclX 


class  densities  in  figure  3-2  the  error  count  estimator  would  not  be 

allowed  in  the  family  since  V a < a , 1 € T„(cr)  and  2 € T (a)  . 

max  2 1 

3 . 2 . A Computational  Requirements  for  Estimators  in  The  Frrmily 
The  computational  requirements  for  an  estimator  in  the  family  will 
be  given  by  the  expected  number  of  class  conditional  density  evaluations 
it  requires  per  sample.  As  in  section  3.2.2,  let  us  first  consider  the 
number  of  density  evaluations  explicitly  required  by  the  estimator  R(a)  , 


disregarding  the  fact  that  the  error  function  £ (X.)  must  be  computed 

m j 


for  each  sample  X.  and  for  each  class  m € T„  (<>).  Then  from  equation 

J J 
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(3-29),  at  each  point  X^  the  densities  f^(X^)  for  classes  l € T^Ca), 

for  all  m £ TQ  (a)  must  be  computed.  As  in  definition  3-1,  let  the 

6 . 

J 

sets  Q^Co) , m=l,2...M  be 

Va) 


U T (a) 


(3-30) 


«€Tm(Q,) 


Then  for  each  sample  X ^ , j=l,2...N,  R(a)  requires  explicitly  the  con- 
ditional densities  f (X . ) V t (E  QD  (»)  • 

l j 

Define  the  modified  error  function  £ (X,Q„(cO)  for  m € Tc(a)  as 

my  y 

Definition  3-5 

ejX.Q^o))  = 0 fm(X)nm  > f^(X)nt  V <,  € Q0(“), 

1 3 k 6 QQ(ar)  3 

f.  (X)n.  > f (X)n 
k k m m 

Then  the  modified  error  function  may  be  evaluated  on  the  basis  of  only 
those  conditional  densities  f/  (X) , l € Qg(O')  explicitly  required  at  X 
by  R(o)  . 

The  following  theorem  shows  that  the  modified  error  function  is 
equal  to  the  error  function  whenever  the  joint  density  of  a sample  X 
and  its  label  8 is  greater  than  a . 

Theorem  3-2 

For  the  random  vector  (X,9),  if  m € T0(cv)  and  if  f 0(X)tt0  > cy 

then  £ (X,Qfi(a))  = £ (X) . 
mom 

Proof 

£m(X,Q0(a))  = 1 ^ £m(X)  = 1 since  if  3 k 6 Q0(°O  3 fR(X)nR  > £mW\ 

then  3k  € {l,2...M}  3 f.  (X)n,  > f (X)n  • We  will  show  that  £ (X)  = 1 => 

k K m it.  m 


£ (X,QQ(a))  = 1 by  contradiction.  For  suppose  £ (X)  = 1 and 
in  0 m 


(x1,e1=i> 


(x2,e2=i) 


X 


T^cr)  = [1, 2) 
T2(o>)  - {1,2} 
T3(a)  = {3} 


“ [1,2} 
Q2(q)  = [1,2} 
Q3(or)  = {3} 


Figure  3-3.  Modified  error  function  equals  true  error  function 
for  sample  X.  since  fD  (X,  )tt0  > a , but  since 
1 01  1 91 

f.  (X  )no  s q-  the  true  error  function  must  be 
°2  2 °2 

used  for  sample  X . 
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£ (X,QC  (O') ) = 0.  Then  f (X)n  > f . (X)n.  VU  Qfl(cr)  but  3 k ( Q (a) 
m u m m is  u 0 

such  that  f,  (X)T7.  > f (X)tt  . But  since  6 £ TQ(a)  and  Tft(ar)  CQ  (or), 
k k m ra  o o o 

6 € Qa(a)  so  f,  (X)rr,  > f (X)n  a £ (X)n.  > a.  Thus  k € T-(cr)  and  since 
u kkramOv  u 

Tg(or)  cyu),  k € Qg(Q'),  a contradiction. 

Example  3-2  Consider  the  three  class  densities  in  figure  3-3  and  the 
sets  Tn(Q')  and  0n)(Q')  associated  with  the  given  a.  Since  0^  = 02  = 1 
and  T^(ar)  = { 1 , 2 } , the  functions  C^X^),  fi^X^),  £j  (X,,)  and  £2(X2)  must 
be  determined.  Since  f (X^)tt^  > a,  £^  (X^  ,Q^  (a) ) = £^(X^)  = 1 and 

^2 (Xf *^i (a> ) = = However>  for  the  samPle  x2>  &2 (^2 * ^ 1 ~ 0 

but  £2^X2^  = *since  fjCX^TTj  s a > t'le  conditional  density  f^(X2) 

must  be  computed. 

Define  the  indicator  function  for  the  event  f0(X)TTg  > or  as 


Definition  3-6 


■ { 


1 fo(x)TTe>a 

0 fg(X)ne  * a 


Then  as  a corollary  to  theorem  3-2,  we  have 
Corollary  3-2  If  m € Tg(or)  then 

I0(X,or)  £m(X,Qe(or))  = l0(X,or)  £m(X). 

By  corollary  3-2,  a computational  form  for  the  estimator  R(ar)  is  given  by 


cR(«)  « ^ E {lfi  (X  ,Qf)  £ (X  Q (or))  + 

N J"1  mCTe  (a)  9j  J m J 6j 

j (3-31) 

f (X  )tt 

(i  - I0  (Xj.Q'))  £m(Xj)}  £ f (x  )TT. 

j 4€T  (cr)  1 3 

vn 

Note  that  while  the  estimator  R(a)  itself  depends  on  a only  through 
the  sets  Tra(o),  m=l,2...M,  the  computational  form  cR(o)  uses  a directly 
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to  determine  which  densities  need  to  be  evaluated.  For  if  (X.9j)  is  such 

that  fQ  (X.)nQ  > O'  then  only  those  densities  f,(X.)  for  L € Qq  (a)  must 

6.  j 6.  I j 0. 

be  computed,  since  by  theorem  3-2,  the  modified  error  function 

£ (X.,Qq  (a))  is  equal  to  the  true  error  function  £ (X.),  for  m € Tn  (ot) . 
m i o . m j u . 

J J 


If  f 


g (Xj)TTg  <.  a then  all  densities  f^(X^)  -t=l,2...M  must  be  computed. 
The  total  number  of  density  evaluations  required  by  cR(o’)  is  thus 


i§l  h |Qe  (o)|  + ( 1 - I0  (X.,flf))M  , 

J j J j j J 


(3-32) 


where  | Qg  (or)  | is  the  number  of  classes  in  the  set  Qg  (a) . The  express- 

j * i 

ion  above  depends  on  the  actual  values  of  the  sample  f (X^9^) , (^^^ » • • • 
(Vn)}.  What  we  are  really  interested  in  is  the  expected  number  of 
density  evaluations  required  in  cR(a),  which  is  just  the  expectation  of 


(3-32). 

Let  C(«)  be  the  expected  number  of  density  evaluations  per  sample 
required  by  cR(Q').  Then 

C(or)  = E{le(X,or)  |Q0(or)|  + ( 1 - Ie(X,or))M] 

M 

= J [l^Cx.cO  I (or ) I + (1  - 1^(x,q))m]  f^(x)TT^  dx 

s 

M 

= tSi  £ I <**(»)  I J*  dx 

+ M J (1  - I^(x,Q'))  f^(x)n^  dx]  . (3-33) 

Let 

^(or)  = J (1  - I^(x,o))  f^(x)n^  dx 


■=  J f . (x)n.  dx 

xGS  ^ 'L 

5f^(x)^  5 « 


(3-34) 
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Then  (3-33)  may  be  written  as 

M 

C(Q)  = (lg1[|qt<ar)|  (n^-U^OO)  + MU^(a)] 


(3-35) 


The  total  expected  number  of  density  evaluations  required  by  cR(or)  is  just 
NXC(a),  the  expected  number  per  sample  times  the  number  of  samples. 

Consider  now  the  behavior  of  C(a)  as  a function  of  a.  It  is  clear 

that  for  each  m=l,2...M,  | T^  (or ) | , the  number  of  classes  in  the  set  T^Co)  , 

is  non-increasing  in  a.  The  points  a , a . . .a  given  by  definition 

o C1  K 

3-4  are  the  values  of  a which  cause  a decrease  in  It  (cr)  for  some 

m 

m=l,2...M.  Figure  3-4  shows  the  behavior  of  It  (q ) I , m=l,2,3  for  the 
; m 

class  densities  given  in  figure  3-1. 

Recall  that  the  sets  Q (a),  m=l)2,...M,  were  defined  in  terms  of 

m 

the  sets  T (or),  m=l,2...M  as  (o ) = U T (O'),  m=l,2...M.  Thus 

m m q£T  (o)  q 

in 

|Q^(ar)|,  the  number  of  classes  in  the  set  Q^(a) , is  also  non- increas ing 

in  o for  each  m=l,2...M.  Analogous  to  the  points  o , or  . . .or  , we 

Co  4 CK 

define  the  points  a , a ....or  that  induce  changes  in  the  sets  0 (o') 
qo  ql  qJ  ^ 

recursively  as  follows. 

Definition  3-7 

Let  o = 0 . 
qo 

Do  i=0  by  1 while  a < a 

q.  max 

1 


Let  a be  the  smallest  value  of  a > a such  that 

qi 


Vi 


0^(0  ) / (^(a  ) f°r  sotTlc  m=l,2...M. 


4+1 


End. 


Let  a be  the  largest  value  so  defined. 


emen 


I 
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Then  Che  values  a , a . . .a  are  those  values  of  or  which  cause  a de- 
% ql  qJ 

crease  in  | (or ) | for  some  m.  Also,  it  is  clear  that  for  each  j=  0, 1...J 

or  = Of  for  some  i = 0, 1...K.  In  figure  3-1,  the  points  a , O'  and 

qj  Ci  qo  ql 

are  given,  and  Table  3-2  shows  the  sets  Qm(Q')  m=l,2,3.  Figure  3-5 

shows  the  behavior  of  |Qm(Q')|  m=l,2,3  for  the  class  densities  in  figure 
3-1. 

From  (3-34),  it  is  clear  that  U^a),  4=1, 2... M is  a non-decreasing 
function  of  ot . Rewriting  the  expression  (3-35)  for  C(a)  as 


C(a)  = U^ar)  (M  - | (a)  |)  + ttJ  (^(ar)  | 

we  see  that  for  O'  <.  a < O'  , C(ot)  is  . non-decreasing,  since 
qi  qi+l 


(3-36) 


V4=1,2...M,  U,  (nt)  is  a non-decreasing  and  |Q^(o)|  is  constant.  Figure  3-6 
shows  a schematic  drawing  of  C(a)  as  a function  of  a for  the  densities 
in  figure  3-1. 

3.2.5  Variances  of  Estimators  in  The  Family 

Next  consider  the  variances  of  estimators  in  the  family 

[R(ff)  : 0 S ff  < a }.  A given  a defines  the  sets  T (a) , m=l,2...M  and 
max  ° m 

hence  an  estimator  R(o r)  of  the  general  form  R(T) . From  the  variance  of 
R(T)  given  by  (3-19),  we  may  write 

f (X)tt 

-} 


» f (X)n 

VARt»«0)  - 5 VARt  e„(X)  -V~t ^x)V 


m€Tg(a) 


4€T  (or) 

m 


(3-37) 


. M 

= n.J  ( £ e (x) 


f (x)n 


S .frw  "■  1 

m 

Let  the  coefficient  of  variance  V(O')  be  defined  by 


) f.(x)  dx  - R*) 


max 


i 


T 


T 


T 

O' 


Figure  3-7.  The  coefficient  of  variance  V(a)  as  a function 
of  ct . 


, 


h 
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V(a)  = N X VAR{R(a)}  (3-38)  ' 

When  a=0,  R(o)  is  equivalent  to  the  posterior  estimator  R(p) . 

Theorem  3-1  shows  that  the  variance  of  the  posterior  estimator  is 

smaller  than  the  variance  of  any  estimator  expressed  in  the  general 

form,  thus  as  a corollary  we  have. 

Corollary  3-1  V(o)  S V(a)  V 0 £ a < a 
1 max 

If  there  exists  an  a < a such  that  T (a  ) = [ml  V m=l,2. . .M,  then 

e max  mv  e J 

a 

R(ag)  is  equivalent  to  the  error  count  estimator.  Thus  by  (3-3), 

V(oe)  = R(l-R) . By  corollary  3-1,  V(o)  ^ V (a ^) . 

Now  V(o')  depends  on  a only  through  the  sets  T (o),  m-l,2...M  (see 

(3-37)).  Since  the  points  a , O'  ...a  induce  changes  in  these  sets, 

Co  1 K 

V(a)  is  a step  function  of  a with  discontinuities  at  these  points. 

Figure  3-7  gives  a schematic  drawing  of  V(q)  for  the  densities  in  figure 

3-1. 

Examples  indicate  that  V(a)  is  a non-decreasing  function  of  a. 

Consider  the  estimator  K (a 0 for  the  conditional  risk  R given  in  (3-28). 

m m 

The  variance  of  R (a)  is 
m 

f 2(  ) 2 ' 

vAR(5mto)}  - i t J en(x)  — -V(^T  d*  - ! • <3‘39> 

s q€Tm(cy)  q q 

Define  the  coefficient  of  variance  V (a)  for  R (a)  as 

m m 

V (o)  = N X VAR{R  (a)}  • (3-40; 

m m ' 

Then  V (or)  is  a non-decreasing  function  of  a since,  as  a increases,  the 
m 

number  of  classes  in  T (a)  in  non- increas  ing  and  hence  E f (x)tt 

m q€T  (a)  q q 

n m 


is  non- increas ing . 
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The  covariance  of  the  estimators  R^ar)  and  (cy ) is  given  by 


COV{Rm(o),  R^or))  = 


(3-41) 


If  v „ r P t„\P  f (x)f  (x)f  (x) 

» l E ni  J em(x>ei(x)  2 ± 1 

r»  . m t \ r\m  / \ /-i  •**  r->  r / \ 


i€Tm(a)nVa)  S 

where  E ( 

i€Tm(a)nT<l(Q') 


E f (x)n  E f (x)n  dx  RmR£ 


qeTm(o)  q q r€T^(Q' ) r 


) = 0 


when  the  intersection  of  T (a)  and  T.  (a)  is  empty,  i.e.  T (a)nTf  (a)  = 0 

m -c  m 'C 

~ " " ‘V 


Let  C^Cor)  = N X COV[Rm(a),  R,  (a))  (3-42) 


M 

be  the  coefficient  of  covariance.  Since  R(cO  = E,  tt  R (a),  the  co- 

m=  1 mm 

efficient  of  variance  V(a)  for  R(o)  may  be  expressed  in  terms  of  V (O') 


and  C .(or)  as 
mi 


M 


2 M 

V(a)  = E n V (a)  + E.  ,E  n n.  C .(o') 
m=l  m m m=l  Ljm  m t mt- 


(3-43) 


Now  V m=l,2...M,  V (a1)  is  non-decreasing  in  oe  and  V (0)  s V (a), 
m mm 

V 0 ^ a < a . However,  from  (3-41),  C .(o')  achieves  its  minimum  value 
max  mL 

of  -R  R.  when  T (a)HT.  (a  ) = 0,  which  would  occur  for  large  values  of  a. 
in  4s  m 4s 

Thus  if  the  increase  in  the  conditional  variances  V (a)  dominate  the 

m 

possible  decreases  in  C .(or) , as  a increases  one  would  expect  the  co- 
rn -t 

efficient  of  variance  V(a)  to  increase. 

3.2.6  Examples 

Some  examples  are  given  in  the  appendix.  In  example  1,  there  are 


five  equally  likely  classes.  The  class  conditional  densities  are 
Gaussian  with  standard  deviations  equal  to  one  and  means  0,  .75,  7,  8,  9.5. 
Page  A-l  gives  a sketch  of  the  densities  as  well  as  the  true  conditional 
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and  unconditional  risks.  Page  A-2  gives  the  values  of  cr  ,a  ...a  , 

o 1 t10 

a , a ...a  and  the  sets  T (a  ),  Q (<*  ),  m=l,2...5  which  they  de- 

q a q m t . Tn  t . 

no  n li 

termine.  Page  A-3  gives,  for  the  points  a t , the  expected  number  of 

i 

density  evaluations  C(a)  and  the  coefficient  of  variance  V(a)  for  those 

densities.  Note  that  C(a)  decreases  to  a minimum  of  2.6  at  at  . The 

6 

sets  T (or  ),  m=l,2...5  on  page  A-2  are  seen  tc  be  the  natural  grouping 
6 

of  the  classes.  The  increase  in  C(p/)  for  a > a is  due  to  the  fact 

t6 

that  while  the  sets  T (a)  and  Q (a)  are  getting  smaller,  for  larger  a 

m m 

it  is  becoming  less  likely  that  a sample  X has  the  property  that 
fg(X)TTg>a.  When  fg(X)ng  sor,  all  densities  f^  (X)  ,£=1 , 2 . . .5  must  be 
computed  to  determine  the  error  function,  so  the  smallness  of  the  sets 
(^(a)  becomes  irrelevant. 

The  variances  V(at)  are  clearly  non-decreasing  in  ? . For  a £ a , 

t6 

they  are  approximately  the  same.  For  a > a , the  variances  about 

C6 

double  at  each  decrease  in  some  set  T . Page  A-4  shows  the  covariance 

m 


for  a = O'  , a and  a 

to  t8  ho 


matrix  for  the  conditional  risk  estimators  ^(a),  R^(a),  m,-l=l,2...M 

Note  that  the  increase  in  variance  as  a 

increases  seems  to  be  due  mostly  to  the  variance  increase  in  R (a), 

J m 

rather  than  to  changes  in  the  covariance  of  R (a)  and  R.(a). 

m -t 

Page  A-5  shows  the  behavior  of  the  distinct  estimators  R(at  ) for 

i 

various  sample  sizes  where  larger  variances  for  larger  a's  are  re- 
flected. For  a^y  , the  estimator  R(at  ) is  equivalent  to  the  posterior 
o o 

A A 

estimator  R(p) . For  a=tr  , R(a  ) is  equivalent  to  the  error  count 

ho  ho 

estimator . 

In  example  2,  the  five  classes  have  unequal  priors  given  by  .1,  .3, 
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.2,  .19,  .21.  The  class  conditional  densities  are  normal  with  means 

0,  .5,  6,  10,  11  and  equal  standard  deviations  of  1.  The  densities 

are  sketched  on  page  B-l  and  the  true  conditonal  and  unconditional 

risks  are  given.  Page  B-2  gives  the  values  a , a ...a  , a ...a 

Co  C1  t8  qo  q3 

and  the  resulting  sets  T^  and  Q^.  Note  that  the  error  count  estimator 

is  not  included  in  the  family  {R(g')  : 0 £ o'  < ot  }. 

max' 

Page  B-3  gives  C(of)  and  V(Q’)  for  the  points  a , i=0,...8.  The 

C1 

expected  number  of  density  evaluations  per  sample  C(<v)  decreases  to  a 

minimum  of  1.94  at  or  . Again  the  variance  VCcr)  appears  to  be  in- 

«,  C8 

creas  ing  in  or . 

A A 

Page  B-4  shows  the  behavior  of  R(cr)  for  a=ot  , i=0...8.  R(a  ) 

i o 

is  equivalent  to  the  posterior  estimator.  The  error  count  estimator 
is  included  for  reference. 

In  chapter  4,  the  problem  of  choosing  an  optimal  estimator  from 

the  family  fR(o')  : 0 < or  < a 3 is  discussed.  Considerations  of  opti- 

max  r 

mality  will  involve  the  variance  V(o)  of  the  estimators  and  C(a),  the 
expected  number  of  density  evaluations  per  sample. 

But  first,  the  technique  of  stratified  sampling  will  be  discussed 
in  terms  of  a family  of  risk  estimators. 

3 . 3 Estimators  Based  on  Stratified  Sampling 

Stratified  sampling  is  a classic  Monte  Carlo  technique  for  reducing 
the  variance  of  an  estimator  for  an  integral  [31,  19,  39,  16].  Basically, 
one  partitions  the  region  of  integration  and  samples  independently  from 
each  partition.  In  this  case,  the  integral  to  be  estimated  is 
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M 

R = E-  TT  T £ (x)f  (x)dx. 
m=l  m * m m 
o 

The  summation  represents  a natural  partition  of  the  integral.  Thus  rather 
than  sampling  (X,0)  where  0 is  random,  the  class  9=m,  is  fixed  a priori 
and  observations  are  sampled  independently  from  the  distribution  of  X 
given  ra,  for  ro=l,2...M. 

Let  the  stratified  sample  of  size  N be  denoted  by  {x^  ,X^  • • )» 

(X21  ,X22 ' * "X2n2^  * ' * ' ^^Sil  ,XM2  ’ ' '^Mn,,)  3 wh< 


ln^)  J where  N -.^n  . The  sampl 


es 


such  that  ^E^n^  = N.  Optimal  and  heuristic  choices  of  these  samples  sizes 


X..,  i=l,2...M,  i=l,2...n.  are  independent,  and  the  distribution  of  X.., 
lj  l ij 

1=1, 2... n.  is  identical  to  that  of  X..  The  density  of  X.  is  given  by 

1 1 1 o J 

f^(x),  the  conditional  density  of  X given  class  i.  The  statistician  is 

free  to  choose  the  n imber  n.  of  samples  from  class  i,  i=l,2...M  in  any  way 

it  In.  = N. 
i=l  l 

will  be  discussed  in  section  3.3.2. 

3.3.1  A Parameterized  Family  of  Bayes  Risk  Estimators 

A family  of  estimators  for  the  Bayes  risk  R based  on  stratified 

sampling  is  defined  as  {SR(o')  : 0 <•  a < O'  }.  The  scalar  parameter  a 

determines  the  set  of  classes  T (o)  that  are  'h-close"  to  class  m, 

in 

m=l,2...M  as  in  section  3.2.3.  The  stratified  estimator  SR  (a)  for  the 

m 

conditional  risk  R is  based  on  samples  X..  for  i € T (a),  j=l,2...n.  and 


ij 


is  given  by 


i f (X. .) 

SR  (or)  = £ — .£,  e (X  ) S— 1 U r 

' *V>  J=1  ’ lJ 

m 

SR  (a)  is  an  unbiased  estimator  for  R since 
m m 

f (X . ) 

E£  3 R (or ) } = E TT.  E{em  (X.)  1,1  1 


(3-44) 


i£T  (a) 


l'  £ f.  (X  )n. 

t6T  (a)  1 1 * 

m 
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fm(x) f . (x)dx 

= £ n j £ (x)  v , . — = J £ (x)f  (x)dx  = R . (3-45) 

i€T  fa)  1 S Z m ; 

b l£T  (a)  1 ^ S 

m 

A 

The  estimator  SR(a)  for  the  unconditional  risk  R is  defined 


SR  (a)  = £ n SR  (a) 

m=  1 m m 


M TT.  ni  ' f (X.  .) 

E,  n £ — .E.  £ (X..)  ™ 

m=1  m i ct  n<  J=1  m ^ f/ 


(3-46) 


i€T  (a)  i 
ra 


i€T  (a) 
m 


and  is  unbiased  by  linearity  of  the  expectation  operator.  Also,  since 

M m 

i€T(a)iffraGT.(a),  £ JT  = .E  £ , and  thus  SRfa)  may  be 


written 


i€T  (a) 
m 


m€Ti(a) 


M tt  i f (X.  .)tt 

SR  (a ) - .£.  — .£  £ £ (X..)  -m-  -^7vm  , 

1-1  °i  J_1  ra€T.(a)  m 1J  <,6T  (a)  ij^ 

in 


(3-47) 


When  a=0,  SR(0)  is  equivalent  to  the  stratified  posterior  estimator 

A ^ 

SR(p),  where  SR(p)  is  given  as  in  [39]  by 


M n ni 

“<'>  " i5i  ^ jSi  *<V 


(3-48) 


M TT.  “i  M f (X..)n 

.£  .£.  £ £ (X..) 

t-1  n j-1  m=l  mv  ij  M 

^lf4(Xij)TT-t 


Also,  if  3 a < a such  that  T (a  ) = [m]  Vm=l,2...M  then  SR(a  ) is 
e max  me  e 

equivalent  to  the  stratified  error  count  estimator  SR(ec),  where  SR(ec) 
is  [39], 


M TT.  i 

SR(ec)  = .£  — £ £.  (X.  . ) 

i=l  n.  j=l  iv  ij 


(3-49) 
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3.3.2  Variances  of  Estimators  in  The  Family 
The  variance  of  the  estimator  SR(a) 


M TT . 


f (X.)TT 
m l m 


VAR[  SR  (or)  3 = ,1.  -1  VAR{  E t (X  ) ™ - - m } 

i ^ .•  r* r /~. \ tn  i Z 


mCT . (a) 

l 


M tt . 

{ S ( r c (x) 

11  ni  S mCT.(a)  m 


t€T  (a) 
m 


f(x)nm 

— ; 7 r — ) f . (x)dx 

T.  f.  (x)n/  iv 

■l€T  (a)  * 
m 


(3-50) 


f (x)tt  f . (x)dx  2 

C J e e.w-r^  3 


i £ f.  (x)rr 

S "6Ti<a>  UT  (cr)  1 1 

m 


] } • 


A heuristic  choice  of  the  number  of  samples  n.  from  class  i is  to  let  n. 

l l 

be  proportional  to  the  prior  probability  tk  of  class  i,  thus  n.  = Nn^. 
Even  though  this  choice  is  not  optimal,  the  estimator  SR(a)  based  on 


the  stratified  sample  has  smaller  variance  than  the  estimator  R(a) 
based  on  unrestricted  sampling.  To  see  this,  with  n^  = Nrr^,  the  variance 
of  SR (a ) is  given  by 
VAR[SR(a)3  = 


. M f (x)tt 

£ Lgj  n J ( E t (x)  f f 

" 1"1  lS  m€T . (a)  " , * £l 


— s — ; , . ■ > f.(x)dx 
E f (x)n^,  iv 

^€T  (pc)  ^ 


M f (x)n  f . (x) dx  2 

- ,5,  n [ I E e ■ <x>  § ] ) • 


(3-51) 


By  Jenson's  inequality  [7],  and  the  fact  that  .E,  E = E,  E , 

i — i /-»*>  /_.  \ m— 1 , f _ \ 


m£T . (a ) 

l 


i€T  (or) 
m 


we  have  that 
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M f (x)n  f . (x) 

.E.  n.  [J  E e (x)  — 1 3= f-y-j—] 

s m6Ti(Q'>  ter  (a)  4 4 


M f (x)n  f.(x)  dx 

i 


T.  1.  (x)n . 

UT  (a)  * ^ 

m 


(3-52) 


Thus , 


E f . (x)rr . 

M i€T  (cr)  1 1 2 ? 

= [m§l  { en,(x)f,n(x)nm  E 'f.-(x)n.dx  ] = R 

b UT  (a)  c 

m 


VAk[SR(a)}  <: 


M f (x)T7 

5 <iSi  nt  { < r7-  , e«.<*>  — fi<*>d*  • R ) 

s »a.(o,)  WT  (a)  t t 


= var{r(<*)3  . 

An  optimal  choice  C 3 9 , 3 1 3 of  the  number  of  samples  from  each 


(3-53) 


class  is  to  choose  n^,  i=l,2...M  to  minimize  the  variance  of  the  esti- 
mator SR  (a).  Let 


o.'(a)=J(  I fm(x) 


f (x)h 
m m 


i - ' ,~(a,  ’I*'"'  S f.(x)n/  ‘i 

S n.€T.(Qr)  ^ { t l 

m 


) f . (x) dx 


f (x)n  f.(x)dx 

-tJ  i e (x)-m- 


S m€Tj(Q') 


E f (x)n 

t€T  (a)  ^ * 

m 


(3-54) 
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i 


Then 


M - o . (o) 

VAR{  SR  (a)  ] = .E,  TT.  — . 

1=1  l n. 


(3-55) 


For  a given  o,  the  optimal  choice  of  n , i=l,2...M  is  found  by  solving 
the  constrained  minimization  problem  QU  ) ; 


(Nl) 


M 2 CJ  i (or ) 

minimize  .E,  n . 

1=1  l n. 

l 


M 

subject  to  n = N . 


It  is  shown  in  [39,  31]  that  the  solution  to  (Nl)  is 


. n.o.(a) 

n.^  = N , i=l,2...M  . (3-56) 

T V-  <“> 

t=l  ^ “ 

k 

Thus  the  optimal  choice  of  the  sample  sizes  ru  =n . > i=l,2...M  agrees  with 
the  heuristic  choice  n^^  = Nrr^,  i = l,2...N  only  when  0^(0)  = 0(0)  Vi  = l,2...M. 
Of  course,  the  problem  with  using  the  optimal  choice  is  that  knowledge 
of  0.(0),  i=l,2...M  is  assumed,  and  if  we  knew  this  we  would  probably  know 
the  true  risk  R.  Since  choosing  n^,  i=l,2...M  proportional  to  the  prior 

A 

probability  of  class  i causes  the  stratified  estimator  SR(a)  to  have 
smaller  variance  than  the  unrestricted  estimator  R(o)  anyway,  we  will 
assume  in  the  sequel  this  heuristic  choice  of  sample  sizes. 

As  in  unrestricted  sampling,  examples  indicate  that  the  variance 
of  SR(ar)  is  non-decreasing  in  a.  Moore,  Whitsitt  and  Landgrebe  [ 3 0] 
give  a 2-class  example  where  the  stratified  posterior  estimator  has 
smaller  variance  than  the  stratified  error  count  estimator.  However,  for 
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for  their  example,  the  error  count  estimator  would  not  be  included  in 

the  family  {SR(»)  : 0 £ a < <>rnax3  - 

In  fact,  it  is  not  clear  (as  it  was  with  unrestricted  sampling) 

that  the  variance  of  the  estimator  SR  (o)  of  the  conditional  risk  R is 

m m 

noA-decreas ing  in  a.  For  in  this  case 


i , C<*> 

VAR{SRm(0'))  ■ J ( J ejx)  r f (x)n 

S z-m  /-  \ ^ ^ 


(3-57) 


f (x) f . (x)dx 


- * "i  [ l e»(x>  "r  f. (x)p,3  1 


i€T  (a) 

m ' 


For  completeness,  the  covariance  of  the  conditional  stratified  estimators 


SR  (or)  and  SR,(ar)|  is 

m <, 


COV[SRm(tt),  SR^(o)}  = 


(3-58) 


f (x)f.(x)f  (x)dx 

N [ /..x  C t em(x)e(,(x)  E f <x)n  E l,(x)n 


iGT  (of)riT.  (or)  S 

m *0 


q€T  (or)’  q q r€T.  (*)  r 

m 4 


f (x)f  (x)dx  f. (x)f.(x)dx 

- i -V-T-ShT  { W * 1-  oo„  11 

5 ,6T  (»)  ’ ’ s r£T,(.)  r r 

m 

which  is  zero  when  T (aOriT.fa)  = 0. 

m 

3.3.3  Computational  Requirements  for  Estimators  in  The  Family 
As  for  unrestricted  sampling,  a computational  form  for  the  strati- 
fied estimator  SR(ci')  is  defined 
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M n.  i 

cSR(a)  = E ^ E E (I  (X  a)  t (X  Q (or)) 
1 1 i J 1 m€T . (a ) 1 1J ’ ra  1J  1 


(3-59) 


* (1  - W*”  e-<V> 

«T  (a)  1 *J  " 
m 


where  I (X.  . ,ar)  = f 1 f . (X. ,)n.  > a, 
i ij  J l ij  l ’ 

1 0 f .(X.  ,)tt  . s a 
i ij  i 

The  expected  number  of  density  evaluations  per  sample  required  in 

* 

cSR(ff)  is  SC(or),  where  1 

. M n. 

sc(?)  =n  iii  if-  (Ut(af)(M  -|Q  (or)|  +n  |q  (or)  | ) . (3-60) 

i x 1 

With  the  heuristic  choice  ni  = Nn  i=l,2...M,  we  have  that 

M 

SC(a)  = .Ej  U.(o)  (M  - lQi(or)|)  + n.  |Qi(«)  1 • (3-61) 

Thus  SC  (o')  is  equal  to  C(a)  (see  3-36)),  so  that  the  expected  number  of 
density  evaluations  required  by  the  stratified  estimator  cSR(a)  is  the 
same  as  the  number  required  by  the  unrestricted  estimator  cR(a) . 
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CHAPTER  4 

OPTIMAL  ESTIMATORS 


4 . 1 Introduc  t ion 

Two  families  of  unbiased,  consistent  estimators  for  the  Bayes  risk 
have  been  proposed:  [R(a)  : 0 a < ° ) for  unrestricted  sampling  and 

{SR(of)  : 0 £ a < Q'max)  for  stratified  sampling.  Given  a sampling  tech- 
nique, the  problem  now  is  to  choose  an  estimator  in  that  family  which 
is  optimal  for  our  purpose.  We  will  restrict  attention  to  the  family 

fR(Q')  :0S  a < a 1.  Extension  to  stratified  sampling  is  obvious, 
max  r 

There  are  two  major  considerations  in  the  optimality  of  an  estimator. 
One  is  its  accuracy,  by  which  is  meant  some  measure  of  the  concentration 
of  the  estimator  about  the  true  risk  R [34].  We  take  as  the  accuracy 


of  the  estimator  R(o:)  its  variance  VAr[r(q')3  = > where  V(a)  is  the 

coefficient  of  variance  defined  in  section  3.2.5  and  N is  the  sample 
size.  Thus  the  smaller  the  coefficient  of  variance,  or  the  greater  the 
sample  size,  the  greater  the  accuracy.  By  the  Central  Limit  Theorem 
[22],  each  estimator  R(Q'),  0 s a < am  is  asymptotically  normal  with 

V (a) 

mean  R and  variance  ^ ' . Thus,  at  least  asymptotically,  all  information 
about  the  accuracy  of  an  estimator  in  the  family  is  contained  in  its 
variance . 

The  other  consideration  in  the  optimality  of  an  estimator  is  the 
amount  of  computation  it  requires.  In  many  problems  [39,  1,  23], 
point  evaluations  of  the  conditional  densities  used  in  risk  estimators 

A 

are  costly.  Thus  the  amount  of  computation  required  by  an  estimator  RCa) 
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is  taken  as  the  expected  number  of  density  evaluations  NxC(a)  necessary  to 

I 

obtain  the  estimate,  where  C(qO  is  defined  in  section  3.2.4  as  the  expected 
number  of  density  evaluations  per  sample  and  N is  the  number  of  samples. 

The  estimator  in  the  family  fR(o')  : 0 S a < a } with  the  smallest 

max 

coefficient  of  variance  VCa)  has  the  property  that  it  requires  the  least 
number  of  samples  to  achieve  a given  accuracy.  However,  when  density  eval- 
uations are  costly,  the  size  of  the  sample  is  not  sufficient  to  character- 
ize the  amount  of  computation  required  by  an  estimator  R(a),  since  the 
average  number  of  density  evaluations  C(a)  it  requires  per  sample  is  also 
a factor.  Thus  r*ather  than  the  optimality  criterion  of  mi -m mum  variance 
we  choose  the  criterion  of  maximum  computational  efficiency  C£(qO  . The 
estimator  R(a  ) with  maximum  computational  efficiency  has  the  property  that 
it  requires  the  least  amount  of  computation  to  achieve  a given  accuracy 
[16,  17]. 

Because  of  the  behavior  of  the  computational  efficiency  CC(a)  as  a 
function  of  a,  maximization  may  be  carried  out  over  a finite  number  of 

•k 

points.  An  algorithm  to  determine  a to  maximize  the  computational  effi- 
ciency C£  (a)  is  presented. 

* ★ 

The  optimal  estimator  R(a  ) is  compared  with  the  existing  error  count 
and  posterior  estimators.  It  is  shown  that  the  more  accurate  the  estimate 
of  the  risk,  the  greater  the  computational  savings  will  be  by  using  the 
optimal  estimator. 

Since  in  practice  it  is  not  possible  to  maximize  the  computational 
efficiency  analytically,  a technique  whereby  n of  the  total  N samples  are 
used  to  approximate  the  optimal  estimator  is  presented.  The  n samples 


should  contain  enough  information  on  the  gross  properties  of  the  densi- 
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ties,  such  as  the  closeness  of  various  classes,  to  closely  approximate  the 
optimal  estimator.  The  remaining  N-n  samples  are  used  to  obtain  an  accu- 
rate estimate  of  the  risk  with  minimum  computation. 

4 . 2 Computational  Efficiency:  A Criterion  for  the  Optimal  Estimator 

The  computational  efficiency  C£(or)  of  an  estimator  R^)  is  defined  as 
the  inverse  of  the  product  of  the  amount  of  computation  it  requires  and 
its  accuracy.  Since  the  accuracy  of  the  estimator  R(o)  is  taken  as  its 
V(a) 

variance  > ar>d  the  computational  requirements  as  NxC(o'),  its  average 

number  of  density  evaluations,  we  have 

« 

Definition  4-1 


ce(of)  = 


i 


V(<*)xC(o)  ‘ 


The  optimal  estimator  in  the  family  fRfo)  : 0 s a < c*  ] is  defined 

r J 1 max' 

a k 

as  that  estimator  R(o>  ) with  maximum  computational  efficiency.  Thus 
Definition  4-2 

« A — A 

The  optimal  estimator  in  the  family  lR(a)  : 0 £ a < a J is  R(a  ), 

max 

* 

where  a is  such  that 

max  Q,t(a)  = C£(or  ) . (4-1) 

0^a<a 

max 

The  optimal  estimator  Rfa  ) has  the  property  that  it  achieves  any 
given  accuracy  with  a minimum  of  computation.  More  precisely,  let  the 

★ ~ * k 

sample  size  be  chosen  such  the  estimator  R(c*  ) based  on  N samples  ob- 

★ 

tains  accuracy  a.  That  is,  let  N be  such  that 


Vfa*)  _ 
* 

N 

a 


a . 


(4-2) 


Let  R(a)  be  any  other  estimator  in  the  family  with  sample  size  N chosen 
to  obtain  the  same  accuracy  a. 
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Thus  N is  such  Chat 


v£l 


N 


= a . 


(4-3) 


Then  the  amount  of  computation  required  by  R(a  ) based  on  N samples 

a 

is  less  than  that  required  by  R(<r)  on  N samples,  i.e. 


"k  k 

N XC(<r  ) <;  NXC(<*)  . 


(4-4) 


k k K 

The  above  is  merely  the  statement  that  a* , N*  = ~(a  ^ solve  the  constrained 

a a 

minimization  problem  (Ml) 


(Ml) 


minimize 
OsN 
Oar  <a 

max 


NxC(cr) 


subject  to 


IM  = 
N 


Thus  we  have 
Theorem  4-1 

k _ k 

Let  a be  such  that  C£(a  ) = max  Cfi(o') 

Oacra  o 

max 

and  let  N*  = . 

a a 

k k 

Then  a , N solve  the  constrained  minimization  problem  (Ml), 
a 

Proof : 

k 

By  definition  of  a 


CC(or  ) S CG(q')  V 0<a<a 

max 

By  the  definition  4-1  of  C£(q)  we  have 


(4-5) 


k k 

V (a  )xC(cr  ) a V(a)xC(a)  V Osa<a 

max 

^ ^ 

By  definition  of  N , we  may  write  V (a/  ) = a N , thus  from  (4-6) 


(4-6) 


k k 

a N XC(a  ) a V(ar)xC(or)  V Oaa<a 


max 


(4-7) 
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If  N,a  are  such  that 


Vfa) 


a then  V(a)  = aN.  Thus  from  (4-7) 


l 

j 

* * V (a) 

N XC(a  ) <;  NxC(aO  V ar,N  9 = a. 

a N 


(4-8) 


By  symmetry,  the  optimal  estimator  R(a  ) also  has  the  property 

/\  ic 

that  for  a given  amount  of  computation  b,  R(a  ) achieves  the  greatest 
* * b 

accuracy.  Thus  a and  N,  = — solve  the  constrained  minimization 

C(or‘  ) 


problem  (M2) 
(M2) 


minimize 

OSN 

OSQ'<  O' 

max 


V(or) 

N 


subjectto  NxC(or)  = b • 

4 . 3 An  Algorithm  for  Maximization  of  The  Computational  Efficiency 

The  optimal  estimator  R(a  ) from  the  family  fR(a)  : 0 £ a < a 1 

is  determined  by  finding  a to  maximize  the  computational  efficiency 

CG(o').  Because  of  the  behavior  of  the  coefficient  of  variance  V(Q')  and 

the  expected  number  of  density  evaluations  per  sample  Cfa) , it  is  only 

necessary  to  consider  those  values  of  a that  induce  changes  in  the  sets 

T (»)  for  some  m=l,2...M,  namely  a ,a  ...a  of  definition  3-4,  to 
m *0  C1  'K 

* 

determine  a . This  result  is  proved  in  the  following  theorem. 

Theorem  4-2 


max 

Osa<a 


max 


C£  (a ) = max 

i=0, 1 . . .K 


ce(«t  ) 

i 


Proof : 


From  section  3.2.5,  V(o)  is  constant  V a?  3 O'  £ a < a 

t , » c .• 


i+1 


From  section 


3.2.4,  C(»)  is  non-decreasing  V a 3 or  £ or  < a 

Ci  i+1 


Thus 
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C£-(.a)  - v(Q,)xC(a) 


is  non- increas ing  V a 3 a ^ a < O' 


i+1 


Therefore , 


max 

a ^a<a 
t . t 
l 


i+1 


ce(a) 


= CC(at  ) 

i 


Finally 


max  CG.(a)  = max 

0^a<Qf  i=0, 1 . . .K 

max 


max  C£(a) 
a <,o/<a 
ii  ti+l 


(4-9) 


= max  CC^  ).  (4-10) 

* i=0,l...K  Ci 

Since  it  has  been  observed  that  the  coefficient  of  variance  V(a)  is 

non-decreasing  in  o',  we  state  the  following  corollary.  In  this  case, 

CCCo1)  may  be  maximized  over  a ,a  ...a  of  definition  3-7,  the  subset 

% ql  qJ 

of  a >Qf  ...O'  which  induce  changes  in  the  sets  Q (a)  for  some  m=l,2...M. 

£o  h CK  % 

The  convenience  is  that  in  general  J < K,  thus  maximization  may  be 
carried  out  over  fewer  points. 

Corollary  4-2 

If  V(or)  is  non-decreasing  in  a then 

max  Cfi-(o')  = max  C£.(a  ) 

0so<a  i=0 , 1 . . . J qi 

max 


Proof : 

Follows  since  C(a)  and  V(cv)  are  non-decreasing  on  o'  i a < q . 

q . l+l 

Thus  the  problem  of  maximizing  the  computational  efficiency  C£(o?)  is  re- 
duced to  finding  the  points  or  , O'  . . -or  and  evaluating  C£(or)  at  these 

o 1 K 

points.  We  now  describe  a convenient  method  to  find  o'  ,a  ...a  and 

o 1 K 
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the  resulting  sets  T (o.  ),  m=l,2...M,  i=0,l...K. 

m t . 

1 


Let  d be  the  value  of  O'  which  splits  classes  r and  s,  defined  as 
r s 

follows . 


Definition  4-3 


d = max  min  [f  (x)tt  ,f  (x)tt  3 
rs  trrss 

xGS 


r=l,2...M,  s=r,  r+l...M. 

Note  that  for  o < d^,  r 6 Tg(o)  and  s € Tr(o).  This  follows  since  if 

o < max  min  [f  (x)n  , f (x)tt  ] then  3 x € S such  that  f (x)tt  > o and 
x€S  r r 3 S r r 

f (x)tt  > a.  For  o 2 d , r i T (o)  and  s £ T (or).  Thus  d is  the 
s s rs  s r rs 

smallest  value  of  o that  splits  classes  r and  s,  in  the  sense  that 

V o £ d , r £ T (o'*  and  s £ T (o)  . Figure  4-1  shows  three  joint  den- 
r s s r 

sities  and  the  values  of  d^2>di3  and  ^23’ 

For  simplification,  assume  that  the  points  drg>  * = 1,2...M-1, 
s = r+1 . . .M  are  distinct.  Then  the  values  o ,o  . . .o  and  o may 

t-  r t-  mo  v J 


be  obtained  from  d , r = 1.2...M,  s = r,r+l...M  as  follows, 
rs 

Definition  4-4 


Order  the  values  of  d in  increasing  order  as  follows 

rs 

d =0 
r s 
o o 

Do  i=0  by  1 


d = min  d 

ri+lSi+l  r,s 

^d  >d 
rs  r.s. 

St°P  if  ri+l  = Si+1  • 


End . 
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Theorem  4-3 


a = d i=0, 1 . . . K 

t r .s . 

i i i 

or  = d 
max  rK+rsK+i 


Proof : 


By  induction. 


O'  = d = 0 by  definition, 

t r s J 

o o o 

Suppose  = s ’ i=*>2...k  < K. 


By  definition,  d 


is  the  smallest  value  of  a > d such 

rk+lSk+l  rkSk 


that  s,  , l T (a)  and  r ? T (o ) . Since  d = or  by  the  induct- 
k+1  r k+1  Sk+1  rkSk  \ 


ion  hypothesis,  d 


is  the  smallest  value  of  a > a.  such  at 
rk+lSk+l  k 


T (a)  ^ T (a  ) for  some  m (namely  m-r,  ,.,s,  , Thus  d 

m m t.  kH  1 k+1  r,  , , » 


k+1  k+1 


Also,  by  definitions  3-3  and  4-3,  O'  = min  max  f . (x)tt . 

lS-t-<M  x6S 

= min  d..  = d 

liihsM  ^ rK+l’SK+l 

The  sets  T (a  ) m=l,2...M,  i=0,l...K  may  also  be  determined  from 
m t . 
l 

the  ordered  values  d by  the  'oil owing  corollary. 

rs 
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Corollary  4-3 

T (o  ) = [i  | 3 x 3 f.(x)n.  > 0 and  fm(x)Dm  > 0} 
o 

Do  i=0  to  K-l. 


T (or  ) = T (o  ) V m^r  , m^s 
m t . . , m t . l x 

l+l  l 

T ( a ) = T (o-  ) - s.  (delete  s.  from  T ( o )) 

r*  x f r*  f-  1 1 I L 

i i+1  i i i i 

T (»  ) = T (or  ) - r . (delete  r from  T (a  )). 

s . t . . , s.t.  1 i s c. 

l l+l  ii  ii 

Figure  4-1  shows  the  values  d^»  d^^  and  d^  and  their  corres- 
pondence too  ..  .a  , o and  the  sets  T (a  ) , m=l  i=0 , 1 . . . K. 

v t t ' max  m t . 

o - K l 

Assuming  that  the  values  d , r“l,2...M,  s=r , r+1 . . .M  have  been 
computed  from  definition  4-3  and  placed  in  increasing  order  in  corres- 

v. 

pondence  with  a ,o  . . .o  as  in  definition  4-4  and  theorem  4-3,  an 
£o  *1  CK 

algorithm  to  determine  o to  maximize  the  computational  efficiency  C£(o) 

is  as  follows . 

Algorithm  A 

Le  t o - 0 . 
o 


1) 


2) 


For  m-1 ,2 . . .M  let 

T (o  ) = [t  I 3 x 9 f (x)tt  > o ,f.  (x)n,  > or  ] 
m t mm  t-Cv  c 

o o o 

For  m=l , 2 . . .M  let 


Vat  } = 


o qCTm(ot  ) 


V*t  >• 

* n 


3)  Compute  CC(«t  ) = c(o  )xV(o  ) ’ 

o t t 

o o 

★ 

4)  Max  = C£(ot  ),  a = o^  . 
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Do  for  i=l  to  K. 

5)  For  m=l,2...M  let 


W*  = Tm (a t . > v m^ri’Si 


m t . ' ...  - . 

l l-l 


Tr.(V>  ' W > - S1 

11  1 1-1 


T.  ‘v’  ■ w > • ri  • 
11  1 1-1 


6)  For  m=l,2...M  let 


Q (a  ) = U T (or  ) . 

1 ^(“t.)  9 1 


7)  Compute  C£  = ‘ 

i i 

8)  If  C£(y  ) > Max  then 

i 

j. 

Max  = C£(o  ) , a = or  • 
i i 

Note  that  if  V(a)  is  non-decreasing  in  or,  by  corollary  4-2,  steps 

7)  and  8)  need  only  be  done  for  those  i such  that  3 m such  that 

q (cr  ) f Q (ot  ).  Thus  C£(cO  need  only  be  evaluated  at  those  values 
li  m Ci-1 

a ..a  which  induce  changes  in  the  sets  0 (a),  m=l,2...M. 

q q,  q T “* 

^o  1 J 

4 . 4 Comparison  of  the  Optimal  Estimator  With  the  Error  Count  and 
Posterior  Estimators 

* "k 

We  first  compare  the  optimal  estimator  R(a  ) to  the  posterior 
estimator  R(p),  defined  in  section  3.2.1,  on  the  basis  of  relative  com- 
putational efficiency.  The  posterior  estimator  requires  for  each  sample 
X^,  j=l,2...N,  the  evaluation  of  all  M conditional  densities  f^(Xj), 
£=1,2. ..M,  a total  of  NxM  density  evaluations.  The  variance  of  the 
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posterior  estimator  is,  from  (3-9) 


var{  R(p)  } = 


Let  the  coefficient  of  variance  V(p)  be 


(4-11) 


V (p)  = VAR{ r (X) ] . 


(4-12) 


Then  the  computational  efficiency  C£(p)  of  the  posterior  estimator  is 


ce<p>  = m^7F)  • 


(4-13) 


The  computational  efficiency  of  the  optimal  estimator  relative  to 

* 

the  posterior  estimator,  (or  ,p)  is  defined  by 


SWa  ,p)  = 


ce(P) 


(4-14) 


C(a  )xV  («  ) 


The  following  theorem  states  that  the  computational  efficiency  of  the 
optimal  estimator  is  greater  than  or  equal  to  that  of  the  posterior 
estimator . 

Theorem  4-4 

RC£(cx*,p)  2 1 

Proof : 

The  estimator  R(0)  in  the  family  (R(o)  : 0 <,  a < a ] is  equivalent  to 

J 1 max  ^ 


R(p)  in  the  sense  that 


R(0)  = R(p). 


(4-15) 


Thus  R(0)  and  ft(p)  have  the  same  coefficient  of  variance 


V(0)  = V(p). 


(4-16) 


However,  it  may  be  the  case  (if  the  conditional  densities  have  finite 
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support)  that  R(0)  may  be  computed  with  fewer  than  M conditional  den- 
sity evaluations  per  sample.  Thus 

C(0)  £ M.  (4-17) 

From  (4-16)  and  (4-17),  we  have 

V(0)XC(0)  £ V(p)xC(p)  (4-18) 


and  by  definition  of  computational  efficiency, 


ce(o)  acc(p). 


(4-19) 


By  definition  4-2  of  a 


C £(a  ) £ C£(o')  V 0 ^ a < v 


max 


(4-20) 


Thus 


CC(a  ) iCfi(p), 


(4-21) 


and  finally 


(4-22) 


The  computational  efficiency  of  the  optimal  estimator  relative  to 

the  posterior,  fiCC(or  ,p)  has  the  interpretation  that  if  the  sample  size 

N for  the  posterior  estimator  and  N,  for  the  optimal  estimator  are 
P * 

chosen  so  that  both  estimators  have  the  same  accuracy  (variance),  then 

' k 

the  posterior  estimator  will  require  fiC£(or  ,p)  times  the  number  of  density 
evaluations  required  by  the  optimal  estimator. 

To  see  this,  let  the  sample  sizes  N.  and  N be  chosen  so  that 

* D 


Vfa  ) _ v(p) 

Nj.  N 


(4-23) 
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The  amount  of  computation,  expressed  as  the  average  number  of  density 

' k 

evaluations,  required  by  the  optimal  estimator  is  N*xC(a  ).  The  number 
of  density  evaluations  required  by  the  posterior  to  achieve  the  same 
accuracy  is  N^XM.  From  (4-23),  we  have  that 


V(p)xN^  V(p)xM  * 

NpXM  = — X M = • -- ■ — (N^.XC (or  )) 


* 

V (a  ) 


* JL.  ' ic 

V (a  )xC(a  ) 


(4-24) 


C£(a  ) * * * 

= C£(F)  (N*XC(a  >>•  = ,P)(N*xC(a  ))  . 


The  posterior  estimator  requires  RC£(cr  ,p)  times  the  number  of  den- 
sity evaluations  required  by  the  optimal  estimator  to  obtain  the  same 

accuracy.  Thus  by  using  the  optimal  estimator  R(cv  ),  we  have  saved  our- 

★ 

selves,  on  the  average,  8 (a  ,p)  density  evaluations,  where 

8(q*, P)  = ($C£(o*,p)-l)  C(o*)xN*  . (4-25) 

From  (4-25),  it  is  clear  that  the  more  accurate  an  estimate  of  the  risk 
desired,  that  is,  the  larger  N*,  the  greater  the  savings  in  computation 
S(»*,P). 

~ ★ 

The  optimal  estimator  R(o  ) compares  even  more  favorably  to  the  error 
count  estimator  R(ec),  defined  in  section  3.2.1.  The  variance  of  the 
error  count  estimator  is,  from  (3-3), 

VAR[ R(ec) } = . (4-26) 

Since  the  error  count  estimator  requires  evaluation  of  the  conditional 

density  f.(X.)  for  each  sample  X.,  j=l,2...N  and  for  each  class  •6=1,2. ..M, 
^ J J 

the  computational  efficiency  of  the  error  count  estimator  C£(ec)  is  given 
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by 

“<«>  ■ JEScTr)  • <4-”> 

The  computational  efficiency  of  the  error  count  estimator  is  less 
than  that  of  the  posterior.  From  (3-11),  we  have  that 

V(p)  ^ R(l-R)  - < R(l-R)  • (4-28) 

Thus 

" SS?T5)  < S^>  ‘ • <4'29> 

The  computational  efficiency  of  the  optimal  estimator  relative  to 
the  error  count  estimator,  , ec)  , is  given  by 


«<**.«>  - • 

^ ' C(o  )xV(a  ) 


(4-30) 


From  (4-29)  and  theorem  4-4  we  have 


RC£(o  ,ec)  > RUG(a  ,p)  2 1 . 


(4-31) 


If  the  error  count  estimator  R(ec)  is  a member  of  the  family 

fR(o)  : 0 £ O'  < o 1 and  if  V(Qr)  is  non-decreasing  in  a , we  have 
*■  max  ° 


V (cy  ) £ R(l-R)  . 


(4-32) 


In  this  case,  a lower  bound  on  the  computational  efficiency  of  the  optimal 
estimator  relative  to  the  posterior  estimator  is 


* . . HXR(I-R) 

RCc(q  ,e)  2 k e~ 


M 


(4-33) 


C (q  )XR(1-R)  C(cr  ) 


Thus  if  (4-32)  holds,  the  error  count  estimator  requires  at  least  M/C  (or  ) 


times  the  number  of  conditional  density  evaluations  required  by  the  opti- 
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mal  estimator  to  obtain  the  same  accuracy. 

The  number  of  density  evaluations  saved  by  using  the  optimal 

•jV 

estimator  rather  than  the  error  count,  S(o-  ,ec)  is 
S (Qf*,ec)  = (R£C(a*,ec)-l)  C(o*)xN* 

* ( -~T  -1)  C(o*)xN*  (^'34) 

C(or  ) 

= (M-C(o'*))Ny.  . 

Thus  the  more  accurate  an  estimate  of  the  risk  desired  (the  larger  N^) 

* 

the  greater  savings.  Also,  the  smaller  0(0  ) relative  to  the  total 
number  M of  classes,  the  greater  the  savings.  One  would  expect 
M > > C(a ) for  a large  number  M of  classes  which  tend  to  form  several 
small  clusters. 

4 .5  Approximation  of  the  Optimal  Estimator 

The  optimal  estimator  R(o  ) in  the  family  {R(o)  : 0 <:  a < **  x)  is 

-k 

determined  by  finding  a to  maximize  the  computational  efficiency  06(00. 
However,  if  we  had  enough  information  to  maximize  the  computational  effi- 
ciency analytical ly , we  could  evaluate  the  Bayes  risk  R analytically.  We 
propose  that  a subset  of  the  data,  say  £ (Xj0^) , ’ ' ' ' ’ w^ere 
n < < N,  be  used  to  approximate  the  optimal  estimator.  The  remaining  N-n 
samples  are  used  in  the  approximated  optimal  estimator  to  obtain  an  accu- 
rate estimate  of  the  Bayes  risk  efficiently. 

: k 

Recall  algorithm  A is  section  4.3  for  finding  a to  maximize  the 
computational  efficiency  C£(or)  . In  order  to  use  this  algorithm,  we  need 
to  know  the  points  drg>  r=l,2...M,  s=r,  r+1 . . .M  and  the  value  of  the 
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computational  efficiency  at  these  points.  Since  in  practice  these  values 
are  not  known,  we  propose  they  be  approximated  on  the  basis  of  the  n 
samples  { (X^0^) . . . (X^G^) 3 as  follows. 

An  approximation  to  the  computational  efficiency  C£(a)  at  any  given 


a is  formed  as 


C£(a)  = 


(4-35) 


V(a)xC(a) 


where  V(a)  and  6(a)  are  unbiased  estimates  of  V(a)  and  C(a)  given  by 


C(a)  = |Q_t(a)|  (tt^  * \(a))+  MU^(a) 


where 


(4-36) 


l n f/(Xi)n/ 

V“>  ^ sh 


(4-37) 


. n t (X.)  f (X.)tt 

? , x 1 _ , „ m i m 1 m 

V (a ) — . .£.  t ^ y f /y  In 

mtTe .(")  la  (a)  1 J 1 

J m 


(4-38) 


- ( - S 

n k_1  m€T0  (a) 


W Wnm  . 
= A(VV 

-(■€T  (a) 


The  points  d , r = l,2...M,  s=r  ,r+l . . .M  might  be  approximated  by 
r s 


J = max  min  ff  (X . )rr  ,f  (X . )n  }. 
rs  , rj  r s j sJ 

1<:  j£n  J J 


(4-39) 


Once  the  values  d , r=l,2...M,  s=r,r+l...M  have  been  ordered  and 
rs 

put  into  correspondence  with  the  points  or  , a . • -C?  as  in  theorem  4-3, 

o 1 K 


\ 

T,  (&'  HU 

1 

T.(a  HU 

1 

Q,(S  HO 

1 1 

T?(Sr  HO 

C1 

HO 

2 

Q (a  HO 

C1 

Figure  4-2.  Estimates  d12  and  *2^  of  d12  based  on  one  sample 
(X1,01=1)  . underestimates  dJ2  which  results 

in  sets  T that  are  smaller  than  the  true  sets  T 


The  overestimate  d^2  solves  this  problem. 
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algorithm  A may  be  performed  by  substituting  the  approximated  values  for 

the  true  values.  As  a result,  a value  a which  maximizes  the  approximate 

computational  efficiency  C£  is  obtained. 

However,  the  following  difficulty  arises.  Algorithm  A determines 

the  sets  T (a  ),  m=l,2...M,  i=0,l...K  as  in  corollary  4-3.  But  since 
m t . 
l 


cf  ^ d r=l , 2 . . .M, s=r  ,r+l . . .M 

rs  rs 


(4-40) 


the  sets  T (a  ),  m=l,2...M,  1=0, 1 .. .K  which  result  from  corollary  4-3 
m t . 
l 

***  • 

using  the  approximated  values  and  have  the  property  that 


T (or  ) C T (or  ) . 
m t . m t . 

l l 


(4-41) 


Thus  the  approximated  set  T (a  ) of  classes  'ey  -close"  to  class  m is 

m t t . 

l l 

smaller  than  the  true  set  T (a  ) of  classes  'te  -close"  to  class  m. 

m t . ' t . 

l l 

Figure  4-2  shows  this  behavior  for  2 classes,  with  d '-o'  approximated 
with  n=l  sample  X^.G^  = 1. 

A 

The  result  of  this  is  that  if  the  estimator  cR(o'),  in  computational 
form,  is  used  with  the  approximated  sets  ^(Qf),  m=l,2...M,  a biased 
estimate  of  the  risk  results.  The  reason  is  that  in  its  computational 
form,  cR(ar)  uses  the  modified  error  function  (X , Qq  (p/ ) ) for  m € Tq(cO 
whenever  fD(X)nQ  > or.  Although  by  theorem  3-2  it  is  true  that 

V m 6 T 0 (or ) , £m(X)  = ^(X.Qgt!?)) 


(4-42) 


whenever  fg(X)nQ  > » 


it  is  not  true  in  general  that 


V m 6 Te<£),  em(X)  = em(X,Qe(o)) 


whenever  fg(X)ng  > a 


(4-43) 


since  Qq(cT)  may  be  smaller  than  Q (S’)  . 
a 0 

Example  4-1 

In  figure  4-2,  the  values  a and  T (or  ),  m=l,2  were  approximated 

C1  m C1 

with  the  sample  X^,0^  = 1.  If  sample  = 1 were  used  in  the  compu- 

tational form  cR (S'  ),  since  f (X  )tt  > S’  the  modified  error  function 

1 2 1 tj 

would  be  used.  But  (X£  ,Q^  (cT^  ^ = ® while  = 1. 

* 

Examples  indicate  that  the  sets  T (a  ),  m-l,2...M,  i=0.1...K 

m t . 

l 

approximate  well  the  true  sets  T (or  ),  m=l,2...M,  i=0,l...K  (see  section 

i 

4.6)  The  ad-hoc  method  employed  here  is  to  overestimate  the  values 


d , i=0, 1 . . .K  + 1 by 
r . s . J 

l l 


d = f (X)n 
r s r r 

i i i i 


(4-44) 


whenever  d = f (X)tt 

r s . s . s . 
li  l l 


The  value  d^  is  illustrated  in  figure  4-2.  Note  that  in  this  case, 


T (or  > “ TJ5t  > m=1.2  • 

m tj  m t^ 


(4-45) 
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Let  a be  the  approximated  optimal  determined  by  Algorithm  A,  with 

the  approximated  values  C£  and  (determined  from  s ) substituted 

i i i 

* * 

for  the  true  values.  Thus  a an d sets  T (o'),  m=.l,2...M  determine  the 

m 

~ ~ k 

approximate  optimal  estimator  R(q'  ) . 

A * 

In  determining  the  optimal  a , all  conditional  densities 

f^)  4=1,2,  ...M,  j = l ,2  . . .n  (4-46) 

have  been  evaluated.  Since  the  posterior  estimator  R(p)  based  on  n 
samples  is  a by-product  of  Algorithm  A,  and  since  the  posterior  estimator 
has  the  smallest  variance  (theorem  3-1)  we  may  as  well  incorporate  it  into 
the  approximated  optimal  estimator. 


Thus,  let  the  final  estimator  R(f)  be  given  by 
R(f)  = J R(P)  + R(o*> 

where  R(p)  is  the  posterior  estimator  based  on  the  n samples 


(4-47) 


{ (Xj^i)  ■ • • (Xn®n)  ^ use<^  to  determine  a and  R(a  ) is  t lie  approximated 
optimal  estimator  based  on  the  remaining  N-n  samples  { (Xji+^8  ^)  , . . . 

<W5- 

The  final  estimator  R(f)  requires  M density  evaluations  for  each 
of  the  first  n samples  and  an  average  of  C(o  ) for  the  remaining  N-n 
samples.  Its  variance  is  given  by 


VAR{S(f)J  - V<P>  + ^ ’ ■ 

N N 

Thus  the  computational  efficiency  C£(f)  of  the  estimator  is 


(4-48) 


C£(f)  = 


(nM+(N-n)C(a"))(^V(p)  + (^)V(a  )) 
N N 


(4-49) 
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Assume  that  the  n samples  contain  enough  information  so  that  the 

- 

approximated  optimal  estimator  R(o  ) is  close  to  the  true  optimal  esti- 


mator R(a  ).  That  is,  assume 


1 


1 - * 

T-—1T  = C£(a  ) 


ce<or>- * 

V (a  )xc (a  ) V {a  )xC(ar  ) 

Also,  let  us  assume  the  approximation  is  close  enough  so  that 


(4-50) 


ce(o-  ) zcecp)  . 

Then  from  (4-49)  and  (4-51), 

C£(q*)  2 ce(f)  2 ce(p), 


(4-51) 


(4-52) 


the  lower  bound  C£(p)  for  CC(f)  being  obtained  when  n=N  and  the  upper 

* * 
bound  C£(0!  ) when  n=0.  The  best  case  would  be  when  a were  known 

a priori  and  n=0. 

The  computational  efficiency  of  the  final  estimator  relative  to 
the  posterior  estimator,  (i?C£(f,p)  is 


SCW.P)  - IS 


(nM+(N-n)C  {ot*)  ) (-°^V  (p)+^N' ^ V (a*) 
N N 


(4-53) 


which  is  greater  than  one  provided  (4-51)  holds. 

Let  us  now  compare  the  final  estimator  to  the  posterior  estimator 
in  terms  of  the  number  of  density  evaluations  saved.  As  in  section  4.4 
let  8 (f ,p)  be  the  number  of  density  evaluations  one  would  save  by  using 
the  final  estimator  on  N samples  rather  than  the  posterior  based  on  N 

P 

• ■imp  1*.  , where  is  chosen  so  that  both  have  same  accuracy  (variance). 

n 


I .p)  I >(nM*(N-n)C(->  )). 


f4-54) 
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From  theorem  3-1,  we  have 

V(p)£V(o*).  (4-55) 

Thus  from  (4-53),  (4-54)  and  (4-55), 

S(f,p)  N(M.xV.(r):v.(q')C(a  ))  _ (4-56) 

V(o  ) 

From  (4-56),  the  greatest  savings  result  when  N is  large  (an  accurate 
risk  estimate  is  desired)  and  n is  small.  However,  it  is  important  that 
n be  large  enough  to  closely  approximate  the  optimal  estimator  to  assure 
that  (4-51)  holds. 

It  can  b shown  that  if 

V (pi*  ) <:  R(l-R)  (4-57) 

then  the  average  savings  in  number  of  density  evaluations  by  using  the 
final  estimator  rather  than  the  error  count  estimator,  S(f,ec)  for  an 
estimate  of  the  same  accuracy,  is  bounded  below  by 

S(f,ec)  :>  (N-n)  (M-C(q)  ) . (4-58) 

Of  course,  it  takes  a certain  amount  of  work,  over  and  above  the  Mn 
density  evaluations,  to  approximate  the  optimal  estimator  using  the  first 
n samples.  The  average  number  of  density  evaluations  saved  by  using  the 
final  estimator  rather  than  one  of  the  existing  estimators  must  be  compared 
to  this  overhead.  If  the  density  evaluations  are  costly  and  the  number 
saved  is  large,  a net  savings  in  work  should  be  realized. 

4 . 6 Examples 

Consider  the  five  Gaussian  classes  of  example  1 described  on  page  A-l 

of  the  appendix.  Page  A-2  gives  for  these  densities  the  values  d , 

i i 
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1=0,1... 10,  their  correspondence  with  a , i=0,1...10  and  a , i=0,1...5, 

t . q . 

l l 

and  the  resulting  sets  T^cr  ),  (^(c^  )»  m=l,2...5,  i=0,1...10.  The 

i i 

computational  efficiency  CG(o,t  ),  i=0,1...10  is  given  on  page  A-3.  The 

i 

k k 

computational  efficiency  is  maximized  for  a =a  and  CG(a  ) = 20.99.  From 

t6 

page  A-2 , we  have  T^cr  ) = [ 1 , 2 } = T2(a  ),  T3(o*)  = {3,4,5}  = T^(a*)  = 

k k k 

T ),  and  Q^(a  ) = T3(a  ) i=l,2...5.  These  sets  represent  the  natural 
clustering  of  the  classes. 

On  page  A-6,  the  computational  efficiency  of  the  optimal  estimator 

k 

relative  to  the  posterior  is  given  by  RC£(a  ,p)  = 1.92.  Thus  to  achieve 
the  same  accuracy,  the  posterior  estimator  would  require  on  the  average 
1.92  times  the  number  of  density  evaluations  required  by  the  optimal  esti- 
mator. The  computational  efficiency  of  the  optimal  estimator  relative 

k 

to  the  error  count  estimator  is  ftCG (a  , ec)  = 24.07,  thus  the  error  count 
would  require  24.07  times  the  computation  of  the  optima]  for  the  same 
accuracy . 

Also  on  page  A-6  are  the  average  number  of  density  evaluations  saved 

by  using  the  optimal  estimator,  rather  than  the  error  count  or  posterior, 

for  various  sample  sizes  Nj.  for  the  optimal  estimator.  For  example,  if 

the  optimal  estimator  is  formed  using  N.,.=400  samples,  in  order  to  achieve 

the  same  accuracy  the  posterior  would  require  on  the  average  956  more 

density  evaluations  and  the  error  count  23,992  more. 

Page  A-7  gives  the  approximated  values  d , i=0,1...10,  their 

i i 

A A 

correspondence  wither  , i =0,1... 10  and  a , i=0,1...5  and  the  resulting 

t . q . 

l 

^ a * 

sets  Tn)(Qft  )>  (°r t ),  m=l,2...5.  1=0,1... 10,  where  approximations  are 


based  on  n=25  samples.  Comparison  of  the  approximated  values  with  the 
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.true  values  given  on  page  A-2  shows  that  for  1 ^ 6 T (a  ) « T (c*  ) for 

m t m t . 

1 l 

all  m=l,2...M.  Discrepancies  for  i < 6 are  caused  by  the  fact  that  in 

the  approximation,  classes  1 and  4 were  split  before  classes  2 and  5 

and  classes  2 and  3 before  classes  1 and  3. 

Although  the  approximated  values  a , i=0,1...10  do  not  seem  close 

i 

to  the  true  values  , i=0,1...10,  the  computational  efficiency  C£(<5  ) 

i i 

for  the  approximated  values  (page  A-8)  is  only  slightly  less  than  the 

computational  efficiency  CC^  ) for  the  true  values  (page  A-3)  . Thus 

i 

if  the  sets  ) = T^ar  ),  m=l,2...5,  the  estimator  R(S  ) is  equi- 

i i i 

valent  to  the  estimator  R(c*t  ) and  is  almost  as  efficient. 

i 

From  page  A-8,  we  see  that  q =Q  maximizes  the  approximate  computa- 

C6 

tional  efficiency  CS.  Since  T^c?  ) = T (a  ),  m=l,2...5,  and  C£(a  ) = 19. A9 

"k  * 

^ CG(a  ) = 20.99,  the  approximate  optimal  estimator  R(c?  ) is  almost  as 

~ k 

efficient  as  the  true  optimal  estimator  R(c*  ). 

On  page  A-9,  the  final  estimator  is  compared  with  the  posterior  and 
error  count  estimators..-  The  final  estimator  is  formed  as  the  posterior 
estimator  R(p)  on  n=25  samples  and  the  approximated  optimal  R(or  ) on  the 
remaining  N-25  samples.  If  the  final  estimator  uses  a total  of  N=400 
samples,  the  computational  efficiency  of  the  final  estimator  relative  to 
the  posterior  estimator  is  1.71.  By  using  the  final  estimator  rather 
than  the  posterior,  on  the  average  828.93  density  evaluations  have  been 
saved  in  obtaining  an  equally  accurate  estimate  of  the  risk.  Thus  if  the 
work  involved  in  approximating  the  optimal  cr  with  n=25  samples  is  less 
than  the  work  involved  in  evaluating  828.93  densities,  the  final  estimator 
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is  preferable. 

Compared  to  the  error  count  estimator,  if  the  final  estimator  uses 

N=400  samples,  on  the  average  23,887.05  density  evaluations  are  saved. 

If  density  evaluations  are  costly,  one  would  almost  surely  prefer  the 

final  estimator  to  the  error  count. 

Next,  consider  the  five  densities  described  on  page  B-l.  The  values 

d^  s , their  correspondence  with  the  points  a , i=0,1...8  and  a , 
i i i ^i 

i=0,1...3  and  the  resulting  sets  T (a  ) and  Q (or  ),  m=l,2...M,  i=0,1...8 

m t . in  t . 

1 i 


are  given  on  page  B-2.  Page  B-3  lists  the  values  C(a^  ),  ) 


and 


CG(o  ) i=0,1...8.  The  value  a =&  maximizes  the  computational  efficiency 
i C8 

* 

CC(a)  and  C£(a  ) = 17.87.  On  page  B-5,  the  optimal  estimator  is  compared 
with  the  error  count  and  posterior.  The  computational  efficiency  of  the 
optimal  estimator  relative  to  the  posterior  is  1.95  and  relative  to  the 
error  count  is  15.97.  The  number  of  density  evaluations  saved  by  using 
the  optimal  estimator  rather  than  either  of  the  existing  estimators  are 
listed  on  page  B-5  for  various  sample  sizes  for  the  optimal  estimator. 

On  page  B-6  are  given  the  points  d i=0,1...8,  their  correspond- 

riSi 

ence  with  a i=0, 1 . . .8  and  a , i=0, 1 . . .3  and  the  sets  T (a  ) and 
t . q . m t . 

i l i 

VS  ),  m=l , 2 . . .5  , i=0, 1 . . .8  . Note  that  T (a  ) = T (a  ) m=l,2...5, 
l , in  t m t , 

i i i 


i=0, 1 . . .8. 


On  page  B-7,  we  see  that  a =c t maximizes  the  approximate  computational 

8 

A A ^ 

efficiency  C&.  Thus  the  approximate  optimal  estimator  R(a  ) is  equiva- 
lent  to  the  true  optimal  estimator  R(a  ) and  its  computational  efficiency 
CG(or  ) = 17.51  is  only  slightly  less  than  the  computational  efficiency 
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pf  the  optimal  estimator  C£(a  ) = 17.87. 

Again,  the  final  estimator  formed  by  the  posterior  R(p)  on  n=25 

A 

samples  and  the  approximate  optimal  R(a  ) on  the  remaining  N-25  samples 
compares  favorably  to  the  error  count  and  posterior  estimators.  Page  B-8 
makes  comparisons  in  terms  of  relative  computational  efficiency  and  saved 
density  evaluations  for  various  sample  sizes  N for  the  final  estimator. 
Thus  if  the  final  estimator  is  based  on  N=400  samples,  its  computational 
efficiency  relative  to  the  posterior  is  1.76,  and  by  using  the  final  esti 
mator  rather  than  the  posterior,  on  the  average  665  density  evaluations 
would  be  saved  for  an  equally  accurate  estimate  of  the  risk.  The  computa 
tional  efficiency  of  the  final  estimator  relative  to  the  error  count 
estimator  is  14.38  when  the  final  estimator  uses  a sample  size  of  N=400. 
In  order  to  obtain  the  same  accuracy  with  the  error  count  estimator,  on 
the  average  11,707.5  more  density  evaluations  would  be  required. 
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CHAPTER  5 
CONCLUSIONS 

5 . 1 Summary  of  Results 

In  this  thesis  we  have  studied  estimators  for  the  Bayes  risk  in 
terms  of  the  amount  of  computation  they  require  and  their  accuracy. 

The  existing  estimators  for  the  risk,  namely  the  error  count  and  the 
posterior,  were  shown  to  be  inadequate  computationally,  thus  several 
new  estimators  for  the  Bayes  risk  have  been  proposed.  In  particular, 
a family  of  estimators,  indexed  on  a scalar  parameter  a,  was  defined 
in  such  a way  that  estimators  in  the  family  in  general  required  less  com- 
putation than  the  existing  estimators.  The  optimal  estimator  was  chosen 
as  that  estimator  in  the  family  with  maximum  computational  efficiency, 
and  had  the  property  of  requiring  the  least  amount  of  computation  for 
a given  accuracy. 

In  estimation  of  the  Bayes  risk,  point  evaluations  of  the  class 
conditional  densities  are,  for  many  problems,  the  single  most  important 
factor  contributing  to  the  computational  effort.  For  this  reason,  the 
amount  of  computation  required  for  a given  Bayes  risk  estimator  was  de- 
fined as  the  number  of  density  evaluations  involved  in  the  estimation 
procedure.  The  existing  estimators,  the  error  count  estimator  and  the 
posterior  estimator,  require  for  each  sample  , j=l,2...N,  evaluation  of 

the  class  conditional  density  f . (X . ) for  each  class  {.=1,2...M,  a total 

•L  J 

of  NxM  density  evaluations.  Thus  when  the  number  of  classes  M is  large 
or  the  number  of  samples  N is  large  (an  accurate  estimate  of  the  risk  is 
desired),  the  existing  estimators  were  seen  to  be  impractical  from  a 


computat i ona 1 aspect. 
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In  searching  for  an  estimator  for  the  Bayes  risk  which  could  be 

computed  with  fewer  density  evaluations  per  sample,  a general  form  R(T) 

for  Bayes  risk  estimators  was  discovered.  An  estimator  of  the  form  R(T) 

was  defined  by  associating  with  each  class  m some  set  of  classes  T . 

m 

When  the  number  of  classes  M=2,  the  class  of  estimators  of  the  general 

form  R(T)  consisted  of  the  two  existing  estimators,  the  error  count 

estimator  and  the  posterior  estimator.  For  more  than  two  classes,  the 

class  of  estimators  of  the  general  form  contained  several  new  estimators 

for  the  Bayes  risk,  in  addition  to  the  existing  estimators. 

By  restricting  the  set  of  classes  associated  with  each  class  m to 

be  those  classes  T (o)  that  are  'hr-close"  to  class  m,  an  estimator  R(o) 
m 

of  the  general  form  was  defined  which  in  general  required  fewer  density 

evaluations  to  compute.  As  the  scalar  parameter  a varied,  the  sets  of 

classes  T,  (a)  , . . .Tw(a,l  varied  and  a family  {R(cr)  : 0 <.  a < a } of  Bayes 
1 M max 

risk  estimators  was  achieved.  Estimators  in  the  family  were  characteri- 
zed by  the  average  number  of  conditional  density  evaluations  needed  to 
compute  them  and  by  their  variance. 

The  optimal  estimator  R(a  ) from  the  family  was  defined  as  that 
estimator  with  maximum  computational  efficiency,  where  the  computational 
efficiency  of  an  estimator  was  defined  as  the  inverse  of  the  product  of 
its  variance  and  the  average  number  of  density  evaluations  it  required. 

J 

a.  ★ 

It  was  shown  that  the  optimal  estimator  R(Q'  ) required  the  least  amount 
of  computation  to  achieve  a given  accuracy,  or,  syrame tr ical ly , achieved 
the  greatest  accuracy  for  a fixed  amount  of  computation. 

a ★ 

It  was  pointed  out  that  in  practice,  the  optimal  estimator  R(cr  ) 
could  not  be  determined  by  maximizing  the  computational  efficiency, 
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since  this  in  effect  would  require  knowledge  of  the  true  risk  R.  Thus 
a method  was  proposed  whereby  a subset  n of  the  total  N samples  was  used 
to  approximate  the  optimal  estimator.  The  n samples  should  contain 
enough  information  on  the  closeness  of  the  classes  to  determine  an  almost 
optimal  estimator.  The  remaining  N-n  samples  would  be  used  in  the 
approximate  optimal  estimator  to  obtain  an  accurate  estimate  of  the  risk 
with  a minimum  of  computation. 

For  both  examples  given  in  the  appendix,  the  optimal  estimato'  was 
closely  approximated  using  n=25  samples.  In  fact,  for  each  case  the 
approximate  optimal  estimator  was  equivalent  to  the  true  optimal  estima- 
tor,  in  the  sense  that  point  estimates  of  the  risk  resulting  from  either 
would  be  identical.  However,  the  approximate  optimal  estimator  would, 
on  the  average,  perform  slightly  more  density  evaluations  in  forming  the 
estimate . 

The  technique  for  approximating  the  optimal  estimator  forms  as  a by 
product  the  posterior  estimator  based  n samples.  Because  the  posterior 
estimator  has  minimum  variance  among  all  estimators  considered  here,  we 
defined  as  our  final  estimator  the  posterior  estimator  on  the  n samples 
used  in  approximating  the  optimal  estimator  and  the  approximated  optimal 
estimator  on  the  remaining  N-n  samples. 

The  final  estimator  was  compared  to  the  error  count  and  posterior 
estimators.  The  comparisons  were  based  on  the  number  of  density  evalua- 
tions that  would  be  saved  by  using  the  final  estimator  rather  than  one 
of  the  existing  estimators  in  obtaining  equally  accurate  estimates  of 
the  risk.  Situations  for  which  great  computational  savings  would  be 
expected  were  the  following: 
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1.  an  accurate  estimate  of  the  risk  is  desired,  thus  a large 

1 

number  N of  samples  is  used. 

2.  the  number  of  classes  M is  large  and  the  classes  tend  to  form 
several  small  clusters. 

5 . 2 Recommendations  for  Further  Research 

In  section  3.2.5,  variances  for  estimators  in  the  family 
{R(o)  : 0 S a < x3  t>ased  on  unrestricted  sampling  were  derived.  We 
discussed  properties  of  these  variances  and  indicated  reasons  for  be- 
lieving that  as  a increased,  the  variance  of  the  estimator  Rfa)  should 
not  decrease.  Thus  the  question:  under  what  conditions  does  the  following 
proposition  hold? 

Proposition  1 . 

If  0 so.  £ < a 

1 2 max 

then  V(ff^)  S VCo'^)- 

Aside  from  a theoretical  interest,  proposition  1 has  the  practical 
consequence  indicated  in  corollary  4-2.  That  was  that  the  computational 

efficiency  could  be  maximized  over  the  J+l  values  a . . .a  rather  than 

q q t 
^o  J 

over  the  larger  number  K+l  of  values  a . . .q  . Thus  to  determine  the 

to  CK 

optimal  estimator  for  example  1 in  the  appendix,  the  computational  effi- 
ciency need  only  be  evaluated  at  J + 1 = 6 points  rather  than  K + 1 = 11 
points,  and  at  J + 1 = 4 points  rather  than  K + 1 = 9 points  for  example 
2. 

The  same  question  of  non-decreasing  variances  arises  in  the  family 
{ SR  (a)  : 0 s a < of  risk  estimators  based  on  stratified  sampling. 

Moore,  Whitsitt  and  Landgrebe's  example  [30],  in  which  the  stratified 
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error  count  estimator  had  smaller  variance  than  the  stratified  posterior 
estimator,  shows  that  an  analog  to  theorem  3-1  is  not  possible  for  strati- 
fied sampling.  However,  for  their  example,  the  error  count  estimator 
would  not  be  included  in  the  family.  Thus  one  might  still  hope  to  show 

that  the  variances  of  estimators  in  the  family  fSR(a)  : 0 s a < a 3 are 

max 

non-decreasing  as  a increases. 

Finally,  concerning  the  choice  of  the  number  of  samples  n^  from  each 

class  i=l,...M  to  be  used  in  the  stratified  estimator  SR(o') . We  discussed 

two  choices,  the  heuristic  one  with  n.  proportional  to  the  prior  probability 

ic 

of  class  i,  and  the  choice  n^  to  minimize  the  variance  of  the  estimator 
SR(a).  In  view  of  chapter  4,  a better  choice  would  have  been  to  choose 
n^  to  maximize  the  computational  efficiency  of  the  estimator  SR(a) . 

What  implications  would  this  have,  in  terms  of  the  optimal  estimator  for 
stratified  sampling,  and  could  this  be  useful  in  practice? 
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A-3 

Example  1.  Values  of  C(at  ),  V (o't  ) and  Ce(o't  ) 
i 1 i 1 


Q a C(ar)  V*(a)  CR(ct) 

Ci  qi 


0 

o 

a 

% 

5.00 

.01831 

10.92 

.000001 

% 

5.00 

.01831 

10.92 

.000005 

% 

5.0 

.01831 

10.92 

.000027 

% 

5.0 

.01831 

10.92 

.000122 

“'4 

5.0 

.01831 

10.92 

.000174 

% 

a 

ql 

H 

.01831 

13.00 

.00062 

% 

a 

q2 

2.6 

.01832 

20.99 

.03652 

“'7 

3.11 

.02425 

13.26 

.06022 

“‘8 

a 

q3 

3.25 

.07588 

4.05 

.07042 

% 

or 

q4 

3.62 

.14358 

1.92 

.0744 

“ho 

a 

q5 

3.83 

mi 

1.14  | 

♦Estimates  based  on  600  samples 
♦♦Analytic  value  is  .22926 
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Example  1.  Matrix  of  covariances  [C^^or)]  for  a = a , a , a 

o 8 10 


“t 

o 

(posterior) 

.582 

-.127 

.520 

-.122 

-.103 

.35 

-.215 

-.181 

-.088 

.588 

-.077 

-.165 

.166 

-.071 

.185 

.582 

-.127 

-.124 

-.222 

-.184 

O' 

.520 

-.105 

-.187 

-.071 

C8 

.456 

-.049 

-.069 

1 .429 

-.124 

1.038 

1.843 

-.107 

-.114 

-.247 

-.087 

«t 

1.265 

-.076 

-.165 

-.058 

10 

(error  count) 

1.339 

-.175 

-.062 

2.708 

-.134 

1.038 

1.644 

-.126 

-.110 

-.189 

-.080 

a 

1.644 

-.110 

-.190 

-.080 

t10 

(analytic) 

1.45 

-.166 

-.070 

2.39 

-.121 

1.08 

♦Estimates  based  on  600  samples. 
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Example  1.  Estimators  R(a  ) i =0,1... 10  For  Various  Sample  Sizes 

^ * 


N=50  N=100  N=200  true 


a a a R(a)  R(a)  R(or)  R 


* 0 

“t 

o 

O' 

qo 

.36047 

.35832 

.36322 

.356 

.000001 

\ 

. 36047 

.35832 

.36322 

.356 

.000005 

% 

.36047 

.35832 

.36322 

.356 

.000027 

a 

C3 

.36047 

.35832 

.36322 

.356 

.000122 

“‘4 

.36047 

.35832 

.36322 

.356 

.000174 

% 

or 

qi 

.36047 

.35833 

.36323 

.356 

.00062 

\ 

O' 

q2 

.36045 

.35829 

.36318 

.356 

.03652 

“'7 

.37872 

.36578 

.37265 

.356 

.06022 

\ 

a 

q3 

.33586 

.38022 

.37872 

.356 

.07042 

% 

a 

q4 

.34302 

.37835 

.38116 

.356 

**.0744 

“ho 

a 

q5 

.28 

.35 

.365 

.356 

^ A 

* R(a  ) is  equivalent  to  the  posterior  estimator  R(p) . 
o 

**  Rfc  ) equivalent  to  the  error  count  estimator  R(ec)  . 
C10 
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.Example  1 

Computational  efficiency  of  the  optimal  estimator  relative  to  the 
posterior  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  optimal  estimator  rather  than  the  posterior,  for  various  sample 
sizes  N*  for  the  optimal  estimator. 


N* 

RCC(a*,p) 

S(a  ,p) 

100 

1.92 

239 

200 

1.92 

478 

400 

1 .92 

956 

600 

1.92 

1,434  | 

Computational  efficiency  of  the  optimal  estimator  relative  to  the 
error  count  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  optimal  estimator  rather  than  the  error  count,  for  various  sample 
sizes  N*  for  the  optimal  estimator. 


N* 

RCC(or*,ec) 

S (o  , e c ) 

100 

24  07 

5,998 

200 

24.07 

11,996 

400 

24.07 

23,992 

600 


24.07 


35,988 


Example  1.  Values  of  a ^ , a > s > ) and  ^ Approximated  On  the  Basis  of  n=25  Samples. 
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Example  1.  Values  of  C(at  ),  V(<?t  ) and  C£(<?t  ) Approximated  On  the 

1 i i 

Basis  of  n=25  Samples.  Also  C£(c*t  ). 


a 


a 


or. 


O' 


.01973 


10. 14 


10.92 


.000306 


t . 
l 


.01973 


10.14 


10.92 


.014804 


5. 


.01973 


10.14 


10.92 


.0003063 


or. 


5. 


.01973 


10.14 


10.92 


.014804 


.049186 


or. 


5. 


.01973 


10.14 


Of 


4.48 


.01973 


11  .31 


10.92 


12.24 


.017754 


5 j 2.78 


.01973 


18.23 


19.49 


.04447 


a. 


3.19 


.03124 


10.03 


12.61 


.06319 


or 


3.56 


.06121 


4.59 


3.88 


.073341 


3.96 


11432 


.23 


1.81 


.074388 


4.26 


,16667 


1.41 


10 


1.14 


L. 
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Example  1. 

Computational  efficiency  of  the  final  estimator  relative  to  the 
posterior  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  final  estimator  rather  than  the  posterior,  when  the  optimal  esti- 
mator is  approximated  on  the  basis  of  n=25  samples,  for  various  sample 
sizes  N for  the  final  estimator. 


N 

f?C£(f  ,P) 

. 5(f,) 

100 

1.5 

166.75 

200 

1.64 

391.36 

400 

1.71 

828.93 

600 

1.74 

1,275.39 

Computational  efficiency  of  the  final  estimator  relative  to  the 
error  count  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  final  estimator  rather  than  the  error  count,  when  the  optimal  estimator 
is  approximated  on  the  basis  of  n=25  samples,  for  various  sample  sizes 
N for  the  final  estimator. 


N 

92£(f ,ec) 

_ _ 8(f,ec) 

100 

18.78 

5,929.63 

200 

20.49 

11,918.14 

400 

21.46 

23,887.05 

600 

21  .81 

35,866.04 

A 
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Example  2.  Values  of  C(ot  ), 


V (a  ) and  CGte 

l i 


) 


Of 

O' 

t . 
l 

a 

qi 

C(a) 

V*(a) 

C C(a) 

0 

°t 

o 

O' 

qo 

5 

.02184 

9.16 

.000000021 

“‘1 

■ 

5 

.02184 

9.16 

.000000105 

% 

■ 

5 

.02184 

9.16 

.0000002 

% 

■ 

5 

.02184 

9.16 

.0000012 

Of 

C4 

■ 

5 

| .02184 

i 

9.16 

.0006226 

a 

C5 

Of 

ql 

m 

.02184 

10.41 

.0022214 

\ 

O' 

q2 

2.62 

.02205 

17.31 

.003592 

■ 

2.66 

.0224 

16.78 

.0105237 

“'8 

a 

q3 

1.94 

.02884 

17.87 

♦Estimates  based  on  600  samples. 


Example  2.  Estimators  R(c* **t  ) i=0,1...8  and  R(ec)  for  Various  Sample  Sizes 
i i 


N=50 

o 

o 

r— * 

II 

55 

N=200 

true 

R(«) 

R(a) 

R(a) 

R 

.21929 

.22522 

.22067 

.233 

.21929 

.22522 

.22067 

.233 

.21929 

.22522 

.22067 

.233 

.21929 

.22522 

.22067 

.233 

.21929 

.22522 

.22067 

j .233 

.21932  ! 

.22537 

.22078 

j .233 

.21911 

| .22414 

.21977 

.233 

.21915 

.22590 

i 

.22093 

.233 

.19653 

.21974 

.21861 

.233 

.16 

.22 

.2 

*R(cr  ) is  equivalent  to  the  posterior  estimator  R(p) . 
o 


**R(ec)  is  not  allowed  in  the  family. 


B-5 


Example  2. 

Computational  efficiency  of  the  optimal  estimator  relative  to  the 
posterior  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  optimal  estimator  rather  than  the  posterior  for  various  sample  sizes 
for  the  optimal  estimator. 


N* 

RCG(a  ,p) 

S (ct  , p ) 

100 

1.95 

184 

200 

1.95 

368 

400 

1.95 

736 

600 

1.95 

1,104 

Computational  efficiency  of  the  optimal  estimator  relative  to  the 

error  count  estimator  and  the  number  o£  density  evaluations  saved  by  using 

the  optimal  estimator  rather  than  the  error  count,  for  various  sample  sizes 

N for  the  optimal  estimator. 

* 


N* 

RCe(o*,ec) 

S (or  ,ec) 

100 

15.97 

2,904 

200 

15.97 

5,808 

400 

15.97 

11,616 

600 


15.97 


17,424 


I 
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Example  2.  Values  of  C (c? t ),  V0*t  ) and  Cfc(5t  ) Approximated  On  the 


Basis  of  n=25  samples.  Also  C£(<?t  ) 

i 


A 

a 

a 

% 

A 

a 

qi 

C(a ) 

V(<5) 

(&(<*) 

C£(5) 

.0 

a 

“t 

o 

A 

a 

qo 

5. 

.01994 

10.03 

9.16 

.0 

% 

5. 

.01994 

10.03 

9.16 

.0000001 

% 

5. 

.01994 

10.03 

9.16 

.0000002 

S 

5. 

.01994 

10.03 

9.16 

.0000023 

% 

5. 

.01994 

10.03 

9.16 

.0157484 

\ 

a 

ql 

4 .46 

.01994 

11.24 

10.29 

.0447042 

a 

Of 

q2 

3.24 

.01995 

15.47 

13.22 

.0124001 

*‘7 

2.72 

.01921 

19.14 

16.35 

.0124051 

°t8 

a 

q3 

2.00 

.02068 

24.18 

17.51 
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Example  2. 

Computational  efficiency  of  the  final  estimator  relative  to  the 
posterior  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  final  estimator  rather  than  the  posterior,  when  the  optimal  estimator 
is  approximated  on  the  basis  of  n=25  samples,  for  various  sample  sizes 
N for  the  final  estimator. 


1.46 

126.5 

1 .64 

304 

1.76 

665 

1.8 

1,020 

Computational  efficiency  of  the  final  estimator  relative  to  the 
error  count  estimator  and  the  number  of  density  evaluations  saved  by  using 
the  final  estimator  rather  than  the  error  count,  when  the  optimal  esti- 
mator is  approximated  on  the  basif  of  n=25  samples,  for  various  sample 
sizes  N for  the  final  estimator. 


fGe(f  ,ec 


11.99 


13.45 


14.33 


14 . 73 


3022.25 


5913.75 


11707.5 


17505.75 


